December 25, 2017, CITI, Academia Sinica
Slides & photos are available now!!
December 25, 2017 (Monday)
10:00 - 17:30
Auditorium 122 at CITI,
Slides are available now!!
Learning Affinity via Spatial Propagation Networks
I will present recent results on learning affinity for low level vision problems. We propose spatial propagation networks for learning the affinity matrix for vision tasks. We show that by constructing a row/column linear propagation model, the spatially varying transformation matrix exactly constitutes an affinity matrix that models dense, global pairwise relationships of an image. Specifically, we develop a three-way connection for the linear propagation model, which (a) formulates a sparse transformation matrix, where all elements can be outputs from a deep CNN, but (b) results in a dense affinity matrix that effectively models any task-specific pairwise similarity matrix. Instead of designing the similarity kernels according to image features of two points, we can directly output all the similarities in a purely data-driven manner. The spatial propagation network is a generic framework that can be applied to many affinity-related tasks, such as image matting, segmentation and colorization, to name a few. Essentially, the model can learn semantically-aware affinity values for high-level vision tasks due to the powerful learning capability of deep CNNs. We validate the framework on the task of refinement of image segmentation boundaries. Experiments on the HELEN face parsing and PASCAL VOC-2012 semantic segmentation tasks show that the spatial propagation network provides a general, effective and efficient solution for generating high-quality segmentation results.
When time allows, I will also give previews of most recent results on the temporal propagation networks and portraiture rendering from a monocular camera.
Learning Visual Reconstruction
Deep learning approaches have demonstrated impressive results in a wide variety of visual recognition tasks. The successes mainly result from the use of the massive human-labeled data such as the ImageNet. However, in visual reconstruction tasks - recovering 3D scene geometry, dense motion, material, surface normals, and illumination conditions from one or multiple images of a dynamic scene - large-scale ground truth labels are often difficult or impossible to obtain. In this talk, I will give three examples of tackling visual reconstruction using learning-based approaches. To address the dataset problem, I will demonstrate how we can leverage simulation, reconstruction, and consistency as supervisory signals. I will close by listing several exciting open problems.
Video Analytics for AI City Smart Transportation
The prosperity of AI, Deep Learning and IoT are making our world smarter and impacting our life in every aspects. Among many changing fronts, smart transportation represents the core of smart city, that sits at an unique spot with strong technology readiness and boost. With millions of traffic and street video cameras around the world capturing data, there are far less automated analysis that aim to create value to support decision making for traffic optimization, safety, and management. In this talk we will present recent advancements in video analytics technologies for (1) the detection and tracking of pedestrians, vehicles, and motorists for traffic analysis, and (2) behavior and event recognition methods that can automate the decision making support. With the fast advancement of powerful GPU servers and GPU-enabled embedded platforms, automatic video data analytic technologies are mature for improving traffic control, reducing congestions, preventing accidents, supporting surveillance, and upgrading transportation infrastructure, to make our transit systems safer, smarter, and cheaper. We will also summarize results from two recent public contests --- the IEEE SmartWorld NVIDIA AI City Challenge and the IEEE AVSS Workshop on Traffic and Street Surveillance for Safety and Security Challenge, and share thoughts for future developments.
Multi-Object Tracking for Long-Term Nursing Home Video Analysis
The safety and well-being of elderly people can be benefited through continuous observation and care. However, this attention is lacking in the many environments an elderly person lives in. Therefore, we propose to alleviate this problem through automated analysis of surveillance cameras setup in the elderly persons’ living environment, such as a nursing home. The natural first step to automated analysis is the localization and tracking of each person in the environment, which can be formulated as a multi-object tracking problem. We present a manifold learning-based formulation to multi-object tracking and developed different optimization techniques to solve this problem. Furthermore, we present an unsupervised method to collect person re-identification training data based on multi-view geometry, thus enabling us to learn deep features to further enhance tracking. Experiments performed on multi-camera tracking data sets showed that our proposed method can perform tracking in complex indoor environments, and the tracking results can produce a reasonable summarization of thousands of hours of nursing home surveillance video.
Unsupervised Representation Learning by Sorting Sequences
We present an unsupervised representation learning approach using videos without semantic labels. We leverage the temporal coherence as a supervisory signal by formulating representation learning as a sequence sorting task. We take temporally shuffled frames (i.e., in non-chronological order) as inputs and train a convolutional neural network to sort the shuffled sequences. As sorting shuffled image sequence requires an understanding of the statistical temporal structure of images, training with such a proxy task allows us to learn rich and generalizable visual representation.
Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos
We propose an unsupervised method for reference resolution in instructional videos, where the goal is to temporally link an entity (e.g., “dressing”) to the action (e.g., “mix yogurt”) that produced it. The key challenge is the inevitable visual-linguistic ambiguities arising from the changes in both visual appearance and referring expression of an entity in the video. This challenge is amplified by the fact that we aim to resolve references with no supervision. We address these challenges by learning a joint visual linguistic model, where linguistic cues can help resolve visual ambiguities and vice versa. We verify our approach by learning our model with no supervision using more than two thousand unstructured cooking videos from YouTube, and show that our visual-linguistic model can substantially improve upon state-of-the-art linguistic only model on reference resolution in instructional videos.
Improved Bilinear Pooling with CNNs
shown to be effective at fine-grained recognition, scene categorization, texture recognition, and visual question-answering tasks among others. The resulting representation captures second-order statistics of convolutional features in a translationally invariant manner. We investigate various ways of normalizing these statistics to improve their representation power and particularly we identify that the matrix square-root normalization offers significant improvements when combined with element-wise square-root and l2 normalization. A common approach to compute the matrix square-root involves SVD computation on which GPU implementation is not efficient and the gradient computation is not numerically stable. In this talk, I will cover (1) the exact gradient computation of matrix square-root via solving Lyapunov equation to end-to-end fine-tune the second-order feature representation and (2) an approximation based on Newton iterations to compute matrix square-root which is an order-of-magnitude faster than SVD approach.
OpenPose: Hand, Face, and Body Keypoint Detection in Realtime
In this talk I will introduce "OpenPose", the first real-time multi-person system to jointly detect human body, hand, and facial keypoints on single images. This system was built based on three key concepts, including (1) Convolutional Pose Machines, a sequence predictor framework for learning rich implicit spatial models, (2) Part Affinity Fields, a novel representation to encode association between body parts and individual person, and (3) multi-view bootstrapping, a method to automatically generate high quality keypoint labels for hands. This system is built by our team in Carnegie Mellon University. Before we publish Openpose, our method won the 1st place of 2016 COCO keypoint localization challenge, and best demo award in ECCV 2016.
Fast and Accurate Image Super-Resolution with Deep Laplacian Pyramid Networks
Convolutional neural networks have recently demonstrated high-quality reconstruction for single image super-resolution. However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating high-accuracy super-resolution results. In this talk, I will introduce the deep Laplacian Pyramid Super-Resolution Network for fast and accurate image super-resolution. The proposed network progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels. In contrast to existing methods that involve the bicubic interpolation for pre-processing (which results in large feature maps), the proposed method directly extracts features from the low-resolution input space and thereby entails low computational loads. We train the proposed network with deep supervision using the robust Charbonnier loss functions and achieve high-quality image reconstruction. Furthermore, we utilize the recursive layers to share parameters across as well as within pyramid levels, and thus drastically reduce the number of parameters.
Photos are avaliable now!