Posted by Mohamad Ivan Fanany
This writing summarizes and reviews a deep learning that predict how we pose using motion features: MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation.
- Human body pose recognition in video is a long-standing problem in computer vision with a wide range of applications.
- Rather than motion-based features, computer vision approaches tend to rely on appearance cues:
- Texture patches,
- Color histograms,
- Foreground silhouettes,
- Hand-crafted local features (such as histogram of gradients (HoG) )
- Psychophysical experiments  have shown that motion is a powerful visual cue that alone can be used to extract high-level information, including articulated pose.
- A convolutional neural network for articulated human pose estimation in videos, which incorporates both color and motion features.
- Significantly better performance than current state-of-the-art pose detection systems.
- Successfully incorporates motion-features to enhance the performance of pose-detection ‘in-the-wild’.
- Achieves close to real-time frame rates, making it suitable for wide variety of applications.
- A new dataset called FLIC-motion: the FLIC dataset  augmented with ‘motion-features’ for each of the 5003 images collected from Hollywood movies.
- Body pose recognition remains a challenging problem due to:
- High dimensionality of the input data
- High variability of possible body poses.
- Previous work [4, 5]:
- Motion features has had little or no impact for pose inference.
- Adding high-order temporal connectivity to traditional models would most often lead to intractable inference.
- The proposed paper shows:
- Deep learning is able to successfully incorporate motion features.
- Deep learning is able to out-perform existing state-of-the-art techniques.
- Using motion features alone, the proposed method outperforms [6, 7, 8]
- These results further strengthens the claim that information coded in motion features is valuable and should be used when available.
Geometric Model Based Tracking:
- Articulated tracking systems (1983 – 2007):
- The earliest (in 1983), Hogg  using edge features and a simple cylinder based body model.
- More recent (1995 – 2001) [10,11, 12,13, 14, 15,16]:
- The models used in these systems were explicit 2D or 3D jointed geometric models.
- Most systems had to be hand-initialized (except )
- Focused on incrementally updating pose parameters from one frame to the next.
- More recent (2007 – 2010):
- Complete survey of this era .
- Most recently (2008 – 2011), techniques to create very high-resolution animations of detailed body and cloth deformations [20, 21, 22].
- Key difference of the proposed approach: dealing with single view videos in unconstrained environments.
Statistical Based Recognition:
- No explicit geometric model:
- The earliest (in 1995), using oriented angle histograms to recognize hand configurations.
- Shape-context edge-based histograms from the human body [ 28, 29]
- Shape-context from silhouette features .
- Learn a parameter sensitive hash function to perform example-based pose estimation .
- Extract, learn, or reason over entire body features, using a combination of local detectors and structural reasoning:
- “Pictorial Structures” 
- Matching pictorial structures efficiently to images using ‘Deformable Part Models’ (DPM) in  in 2008.
- Many algorithms use DPM for creating the body part unary distribution [ 36, 6, 7, 37] with spatial-models incorporating body-part relationship priors.
- A cascade of body part detectors to obtain more discriminative templates .
- Almost all best performing algorithms since have solely built on HoG and DPM for local evidence, and more sophisticated spatial models.
- Pishchulin  proposes a model that augments the DPM unaries with Poselet conditioned  priors.
- Sapp and Taskar  propose a model where they cluster images in the posespace and then find the mode which best describes the input image.
- The pose of this mode then acts as a strong spatial prior, whereas the local evidence is again based on HoG and gradient features.
- Poselets approach 
- The Armlets approach :
- Incorporates edges, contours, and color histograms in addition to the HoG features.
- Employ a semi-global classifier for part configuration
- Show good performance on real-world data.
- They only show their results on arms.
- The major drawback of all these approaches is that both the local evidence and the global structure is hand crafted.
- Key difference of the proposed method: Jointly learn both the local features and the global structure using a multi-resolution convolutional network.
- An ensemble of random trees to perform per-pixel labeling of body parts in depth images .
- To reduce overall system latency and avoiding repeated false detections, they focuses on pose inference using only a single depth image.
- The proposed approach:
- Extend the single frame requirement to at least 2 frames (considerably improves pose inference)
- The input is unconstrained RGB images rather than depth.
Pose Detection Using Image Sequences:
Deep Learning based Techniques:
- State-of-the-art performance on many vision tasks using deep learning [ 43, 44, 45, 46, 47, 48].
- [49, 50, 51] also apply neural networks for pose recognition.
- Toshev et al.  show better than state-of-the-art performance on the ‘FLIC’ and ‘LSP’  datasets.
- In contrast to Toshev et al., the proposed work introduce a translation invariant model which improves upon the previous method, especially in the high-precision region.
Body-Part Detection Model
- The paper proposes a Convolutional Network (ConvNet) architecture for estimating the 2D location of human joints in video.
- The input to the network is an RGB image and a set of motion features.
- Investigate a wide variety of motion feature formulations.
- Introduce a simple Spatial-Model to solve a specific sub-problem associated with evaluation of our model on the FLIC-motion dataset.
- Aim for :
- The true motion-field: the perspective projection of the 3D velocity-field of moving surfaces
- Incorporate features that are representative of the true motion-field
- Exploit motion as a cue for body part localization.
- Evaluate and analyze four motion features which fall under two broad categories:
- Using simple derivatives of the RGB video frames
- Using optical flow features.
- For each RGB image pair, the paper propose the following features:
- RGB Image pair
- RGB image and an RGB difference image
- Optical-flow vectors
- Optical-flow magnitude
- The RGB image pair:
- The simplest way of incorporating the relative motion information between the two frames.
- Suffers from a lot of redundancy (i.e. if there is no camera movement)
- Extremely high dimensional.
- Not obvious what changes in this high dimensional input space are relevant temporal information and what changes are due to noise or camera motion.
- A simple modification to image-pair representation is to use a difference image:
- Reformulates the RGB input so that the algorithm sees directly the pixel locations where high energy corresponds to motion
- Alternatively the network would have to do this implicitly on the image pair.
- A more sophisticated representation is optical-flow:
- High-quality approximation of the true motion-field,
- Infer optical-flow from the raw RGB input would be nontrivial for the network to estimate,
- Perform optical-flow calculation as a pre-processing step (at the cost of greater computational complexity).
- The paper proposes a new dataset which is called FLIC-motion3.
- It is comprised of:
- The original FLIC dataset of 5003 labeled RGB images collected from 30 Hollywood movies,
- 1016 images from the original FLIC are held out as a test set, augmented with the aforementioned motion features.
- Experimentation with several length of frame difference between image pair.
- Wrap one of the image pair using inverse of best fitting projection between the image pair to remove camera motion.
Convolutional neural network:
- Recent work [ 49, 50] has shown ConvNet architectures are well suited for the task of human body pose detection
- Due to the availability of modern Graphics Processing Units (GPUs), we can perform Forward Propagation (FPROP) of deep ConvNet architectures at interactive frame-rates.
- Similarly, we can realize pose detection model as a deep ConvNet architecture.
- Input: a 3D tensor containing an RGB image and its corresponding motion features.
- Output: a 3D tensor containing response-maps, with one response-map for each joint.
- Each response-map describes the per-pixel energy for the presence of the corresponding joint at that pixel location.
- Based on a sliding-window architecture.
- The input patches are first normalized using:
- Local Contrast Normalization (LCN ) for the RBG channels
- A new normalization for motion features that is called Local Motion Normalization (LMN)
- Local subtraction with the response from a Gaussian kernel with large standard deviation followed by a divisive normalization.
- It removes some unwanted background camera motion as well as normalizing the local intensity of motion
- Helps improve network generalization for motions of varying velocity but with similar pose.
- Prior to processing through the convolution stages, the normalized motion channels are concatenated along the feature dimension with the normalized RGB channels.
- The resulting tensor is processed though 3 stages of convolution:
- Rectified linear units (ReLU)
- A single ReLU layer.
- The output of the last convolution stage is then passed to a three stage fully-connected neural network.
- The network is then applied to all 64 × 64 sub-windows of the image, stepped every 4 pixels horizontally and vertically to produce a dense response-map output, one for each joint.
- The major advantage: the learned detector is translation invariant by construction.
Simple Spatial Model
- The test images in FLIC-motion may contain multiple people, however, only a single actor per frame is labeled in the test set.
- A rough torso location of the labeled person is provided at test time to help locate the “correct” person.
- Incorporate the rough torso location information by means of a simple and efficient Spatial-Model.
- The inclusion of this stage has two major advantages:
- The correct feature activation from the Part-Detector output is selected for the person for whom a ground-truth label was annotated.
- Since the joint locations of each part are constrained in proximity to the single ground-truth torso location, then (indirectly) the connectivity between joints is also constrained, enforcing that inferred poses are anatomically viable
- Training time for our model on the FLIC-motion dataset (3957 training set images, 1016 test set images) is approximately 12 hours, and FPROP of a single image takes approximately 50ms (on 12 cores workstation with NVIDIA Titan GPU)
- For the proposed models that use optical flow as a motion feature input, the most expensive part of our pipeline is the optical flow calculation, which takes approximately 1.89s per image pair.
- Plan to investigate real-time flow estimations in the future.
Comparison with Other Techniques
- Compares the performance of our system with other state of-the-art models on the FLIC dataset for the elbow and wrist joints:
- The proposed detector is able to significantly outperform all prior techniques on this challenging dataset. Note that using only motion features already outperforms [6, 7, 8].
- Using only motion features is less accurate than using a combination of motion features and RGB images, especially in the high accuracy region. This is because fine details such as eyes and noses are missing in motion features.
- Toshev et al.  suffers from inaccuracy in the high-precision region, which we attribute to inefficient direct regression of pose vectors from images.
- MODEC , Eichner et al.  and Sapp et al.  build on hand crafted HoG features. They all suffer from the limitations of HoG (i.e. they all discard color information, etc).
- Jain et al.  do not use multi-scale information and evaluate their model in a sliding window fashion, whereas we use the ‘one-shot’ approach.
- This paper lists a comprehensive and systematic references of literatures on human pose estimation study.
- The new idea is the use of motion features for pose estimation, which is embedded to appearance features deliver the current best performance.
- The estimated pose is 2D location of human joints.
- Some questions come up after reading the paper:
- How this will be applied for 3D pose estimation?
- How this can be integrated into 3D motion sensor estimation such as kinect for game applications?