Review on A Deep Learning that Predict How We Pose from Motion

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews a deep learning that predict how we pose using motion features: MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation.


  • Human body pose recognition in video is a long-standing problem in computer vision with a wide range of applications.
  • Rather than motion-based features, computer vision approaches tend to rely on appearance cues:
    • Texture patches,
    • Edges,
    • Color histograms,
    • Foreground silhouettes,
    • Hand-crafted local features (such as histogram of gradients (HoG) [2])
  • Psychophysical experiments [3] have shown that motion is a powerful visual cue that alone can be used to extract high-level information, including articulated pose.


  • A convolutional neural network for articulated human pose estimation in videos, which incorporates both color and motion features.
  • Significantly better performance than current state-of-the-art pose detection systems.
  • Successfully incorporates motion-features to enhance the performance of pose-detection ‘in-the-wild’.
  • Achieves close to real-time frame rates, making it suitable for wide variety of applications.
  • A new dataset called FLIC-motion: the FLIC dataset [1] augmented with ‘motion-features’ for each of the 5003 images collected from Hollywood movies.


  • Body pose recognition remains a challenging problem due to:
    • High dimensionality of the input data
    • High variability of possible body poses.

Previous Works

  • Previous work [4, 5]:
    • Motion features has had little or no impact for pose inference.
    • Adding high-order temporal connectivity to traditional models would most often lead to intractable inference.
  • The proposed paper shows:
    • Deep learning is able to successfully incorporate motion features.
    • Deep learning is able to out-perform existing state-of-the-art techniques.
    • Using motion features alone, the proposed method outperforms [6, 7, 8]
    • These results further strengthens the claim that information coded in motion features is valuable and should be used when available.

Geometric Model Based Tracking:

  • Articulated tracking systems (1983 – 2007):
    • The earliest (in 1983), Hogg [9] using edge features and a simple cylinder based body model.
    • More recent (1995 – 2001) [10,11, 12,13, 14, 15,16]:
      • The models used in these systems were explicit 2D or 3D jointed geometric models.
      • Most systems had to be hand-initialized (except [12])
      • Focused on incrementally updating pose parameters from one frame to the next.
    • More recent (2007 – 2010):
      • More complex examples come from the HumanEva dataset competitions [17]
      • Use video or higher-resolution shape models such as SCAPE [18] and extensions.
    • Complete survey of this era [19].
  • Most recently (2008 – 2011), techniques to create very high-resolution animations of detailed body and cloth deformations [20, 21, 22].
  • Key difference of the proposed approach: dealing with single view videos in unconstrained environments.

Statistical Based Recognition:

  • No explicit geometric model:
  • The earliest (in 1995)[23], using oriented angle histograms to recognize hand configurations.
    • This was the precursor for
      • The bag-of-features,
      • SIFT [24],
      • STIP [25],
      • HoG, and Histogram of Flow (HoF) [26]
      • Dalal and Triggs in 2005 [27].
  • Shape-context edge-based histograms from the human body [ 28, 29]
  • Shape-context from silhouette features [30].
  • Learn a parameter sensitive hash function to perform example-based pose estimation [31].
  • Extract, learn, or reason over entire body features, using a combination of local detectors and structural reasoning:
    • Coarse tracking [32]
    • Person-dependent tracking [33]
  • “Pictorial Structures” [34]
  • Matching pictorial structures efficiently to images using ‘Deformable Part Models’ (DPM) in [35] in 2008.
  • Many algorithms use DPM for creating the body part unary distribution [ 36, 6, 7, 37] with spatial-models incorporating body-part relationship priors.
  • A cascade of body part detectors to obtain more discriminative templates [38].
  • Almost all best performing algorithms since have solely built on HoG and DPM for local evidence, and more sophisticated spatial models.
  • Pishchulin [39] proposes a model that augments the DPM unaries with Poselet conditioned [40] priors.
  • Sapp and Taskar [1] propose a model where they cluster images in the posespace and then find the mode which best describes the input image.
  • The pose of this mode then acts as a strong spatial prior, whereas the local evidence is again based on HoG and gradient features.
  • Poselets approach [40]
  • The Armlets approach [41]:
    • Incorporates edges, contours, and color histograms in addition to the HoG features.
    • Employ a semi-global classifier for part configuration
    • Show good performance on real-world data.
    • They only show their results on arms.
  • The major drawback of all these approaches is that both the local evidence and the global structure is hand crafted.
  • Key difference of the proposed method: Jointly learn both the local features and the global structure using a multi-resolution convolutional network.
  • An ensemble of random trees to perform per-pixel labeling of body parts in depth images [42].
    • To reduce overall system latency and avoiding repeated false detections, they focuses on pose inference using only a single depth image.
  • The proposed approach:
    • Extend the single frame requirement to at least 2 frames (considerably improves pose inference)
    • The input is unconstrained RGB images rather than depth.

Pose Detection Using Image Sequences:

Deep Learning based Techniques:

  • State-of-the-art performance on many vision tasks using deep learning [ 43, 44, 45, 46, 47, 48].
  • [49, 50, 51] also apply neural networks for pose recognition.
  • Toshev et al. [49] show better than state-of-the-art performance on the ‘FLIC’ and ‘LSP’ [52] datasets.
  • In contrast to Toshev et al., the proposed work introduce a translation invariant model which improves upon the previous method, especially in the high-precision region.

Body-Part Detection Model

  • The paper proposes a Convolutional Network (ConvNet) architecture for estimating the 2D location of human joints in video.
    • The input to the network is an RGB image and a set of motion features.
    • Investigate a wide variety of motion feature formulations.
    • Introduce a simple Spatial-Model to solve a specific sub-problem associated with evaluation of our model on the FLIC-motion dataset.

Motion Features

  • Aim for :
    • The true motion-field: the perspective projection of the 3D velocity-field of moving surfaces
    • Incorporate features that are representative of the true motion-field
    • Exploit motion as a cue for body part localization.
  • Evaluate and analyze four motion features which fall under two broad categories:
    • Using simple derivatives of the RGB video frames
    • Using optical flow features.
    • For each RGB image pair, the paper propose the following features:
      • RGB Image pair
      • RGB image and an RGB difference image
      • Optical-flow vectors
      • Optical-flow magnitude
  • The RGB image pair:
    • The simplest way of incorporating the relative motion information between the two frames.
    • Suffers from a lot of redundancy (i.e. if there is no camera movement)
    • Extremely high dimensional.
    • Not obvious what changes in this high dimensional input space are relevant temporal information and what changes are due to noise or camera motion.
  • A simple modification to image-pair representation is to use a difference image:
    • Reformulates the RGB input so that the algorithm sees directly the pixel locations where high energy corresponds to motion
    • Alternatively the network would have to do this implicitly on the image pair.
  • A more sophisticated representation is optical-flow:
    • High-quality approximation of the true motion-field,
    • Infer optical-flow from the raw RGB input would be nontrivial for the network to estimate,
    • Perform optical-flow calculation as a pre-processing step (at the cost of greater computational complexity).

FLIC-motion dataset:

  • The paper proposes a new dataset which is called FLIC-motion3.
  • It is comprised of:
    • The original FLIC dataset of 5003 labeled RGB images collected from 30 Hollywood movies,
    • 1016 images from the original FLIC are held out as a test set, augmented with the aforementioned motion features.
  • Experimentation with several length of frame difference between image pair.
  • Wrap one of the image pair using inverse of best fitting projection between the image pair to remove camera motion.

Convolutional neural network:

  • Recent work [ 49, 50] has shown ConvNet architectures are well suited for the task of human body pose detection
  • Due to the availability of modern Graphics Processing Units (GPUs), we can perform Forward Propagation (FPROP) of deep ConvNet architectures at interactive frame-rates.
  • Similarly, we can realize pose detection model as a deep ConvNet architecture.
    • Input: a 3D tensor containing an RGB image and its corresponding motion features.
    • Output: a 3D tensor containing response-maps, with one response-map for each joint.
    • Each response-map describes the per-pixel energy for the presence of the corresponding joint at that pixel location.
    • Based on a sliding-window architecture.
    • The input patches are first normalized using:
      • Local Contrast Normalization (LCN [53]) for the RBG channels
      • A new normalization for motion features that is called Local Motion Normalization (LMN)
        • Local subtraction with the response from a Gaussian kernel with large standard deviation followed by a divisive normalization.
        • It removes some unwanted background camera motion as well as normalizing the local intensity of motion
        • Helps improve network generalization for motions of varying velocity but with similar pose.
    • Prior to processing through the convolution stages, the normalized motion channels are concatenated along the feature dimension with the normalized RGB channels.
    • The resulting tensor is processed though 3 stages of convolution:
      • Rectified linear units (ReLU)
      • Maxpooling
      • A single ReLU layer.
    • The output of the last convolution stage is then passed to a three stage fully-connected neural network.
    • The network is then applied to all 64 × 64 sub-windows of the image, stepped every 4 pixels horizontally and vertically to produce a dense response-map output, one for each joint.
    • The major advantage: the learned detector is translation invariant by construction.

Simple Spatial Model

  • The test images in FLIC-motion may contain multiple people, however, only a single actor per frame is labeled in the test set.
  • A rough torso location of the labeled person is provided at test time to help locate the “correct” person.
  • Incorporate the rough torso location information by means of a simple and efficient Spatial-Model.
  • The inclusion of this stage has two major advantages:
    • The correct feature activation from the Part-Detector output is selected for the person for whom a ground-truth label was annotated.
    • Since the joint locations of each part are constrained in proximity to the single ground-truth torso location, then (indirectly) the connectivity between joints is also constrained, enforcing that inferred poses are anatomically viable


  • Training time for our model on the FLIC-motion dataset (3957 training set images, 1016 test set images) is approximately 12 hours, and FPROP of a single image takes approximately 50ms (on 12 cores workstation with NVIDIA Titan GPU)
  • For the proposed models that use optical flow as a motion feature input, the most expensive part of our pipeline is the optical flow calculation, which takes approximately 1.89s per image pair.
  • Plan to investigate real-time flow estimations in the future.

Comparison with Other Techniques

  • Compares the performance of our system with other state of-the-art models on the FLIC dataset for the elbow and wrist joints:
    • The proposed detector is able to significantly outperform all prior techniques on this challenging dataset. Note that using only motion features already outperforms [6, 7, 8].
    • Using only motion features is less accurate than using a combination of motion features and RGB images, especially in the high accuracy region. This is because fine details such as eyes and noses are missing in motion features.
    • Toshev et al. [49] suffers from inaccuracy in the high-precision region, which we attribute to inefficient direct regression of pose vectors from images.
    • MODEC [1], Eichner et al. [6] and Sapp et al. [8] build on hand crafted HoG features. They all suffer from the limitations of HoG (i.e. they all discard color information, etc).
    • Jain et al. [50] do not use multi-scale information and evaluate their model in a sliding window fashion, whereas we use the ‘one-shot’ approach.

My Review

  • This paper lists a comprehensive and systematic references of literatures on human pose estimation study.
  • The new idea is the use of motion features for pose estimation, which is embedded to appearance features deliver the current best performance.
  • The estimated pose is 2D location of human joints.
  • Some questions come up after reading the paper:
    • How this will be applied for 3D pose estimation?
    • How this can be integrated into 3D motion sensor estimation such as kinect for game applications?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s