A Review on a Deep Learning that Reveals the Importance of Big Data

Posted by Mohamad Ivan Fanany

This writing summarizes and reviews on the paper that reveals the importance of Big Data for Deep Learning:ImageNet Classification with Deep Convolutional Neural Networks.


  • Current approaches to object recognition make essential use of machine learning methods.
  • Ways to improve recognition performance:
    • Collect larger datasets
    • Learn more powerful models,
    • Use better techniques for preventing over-fitting.
  • Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]).

Key Ideas:

  • Simple recognition tasks can be solved quite well with datasets of tens of thousands size:
    • If they are augmented with label-preserving transformations.
    • Current best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].
  • Objects in realistic settings exhibit considerable variability:
    • It is necessary to use much larger training sets.
    • The shortcomings of small image datasets have been widely recognized (e.g., Pinto et al. [21]).
  • Recently, it is possible to collect labeled datasets with millions of images.
    • LabelMe [23]: hundreds of thousands of fully-segmented images,
    • ImageNet [6]: over 15 million labeled high-resolution images in over 22,000 categories.
  • The complexity of the object recognition task is immense that the problem cannot be specified even by a dataset as large as ImageNet.
  • Learning thousands of objects from millions of images needs a model with:
    • Large learning capacity
    • Lots of prior knowledge (to compensate for all data we don’t have)
  • Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]:
    • Its capacity can be controlled by varying their depth and breadth
    • Makes strong and mostly correct assumptions about the nature of images:
      • Stationarity of statistics
      • Locality of pixel dependencies
    • Have much fewer connections and parameters compared to standard feedforward neural networks with similarly-sized layers.
  • For high-resolution images, large scale CNNs is still prohibitively expensive to train.
  • GPUs + a highly-optimized implementation of 2D convolution = training a large CNNs.
  • Use non-saturating neurons and a very efficient GPU implementation of the convolution operation.
  • On overfitting problem:
    • Dataset such as ImageNet contain enough labeled examples to train such models without severe overfitting.
    • Due to large size of the network, however, overfitting is a significant problem, even with 1.2 million labeled training examples.
  • To reduce overfitting in the fully-connected layers, employ a regularization method called “dropout” that proved to be very effective.
  • Depth seems to be important: it is found that removing any convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance.
  • The results can be improved simply by waiting for faster GPUs and bigger datasets to become available.


  • A deep CNN to classify 1.2 million high-resolution images in the ImageNet LSVRC-2010 and LSVRC-2012 contest into 1000 different classes.
  • Achieved by far the best results ever reported on these datasets.
  • The implementation the CNN is made publicly available:
    • A highly-optimized GPU implementation of 2D convolution and other inherent operations.
    • A number of new and unusual features which improve performance and reduce training time.
    • Several effective techniques for preventing overfitting.
  • The proposed final network contains:
    • Five convolutional layer
    • Three fully-connected layers
  • The network’s size is limited mainly by:
    • The amount of memory available on current GPUs
    • The amount of training time that we are willing to tolerate.
  • The proposed network takes between five and six days to train on two GTX 580 3GB GPUs.

The Dataset:

  • Over 15 million labeled high-resolution images belonging to roughly 22,000 categories.
  • The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool.
  • Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held:
    • ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories.
    • In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.
    • ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available.
  • Most of the experiments use ILSVRC-2010.
  • Some results using ILSVRC-2012 (no test set labels) are also reported
  • On ImageNet, it is customary to report two error rates: top-1 and top-5:
    • Top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model.
  • ImageNet consists of variable-resolution images.
  • The proposed system requires a constant input dimensionality:
    • Down-sampled the images to a fixed resolution of 256 × 256.
    • Given a rectangular image, first rescaled the image such that the shorter side was of length 256
    • Cropped out the central 256×256 patch from the resulting image.
    • Subtracting the mean activity over the training set from each pixel.
    • Trained the proposed network on the (centered) raw RGB values of the pixels.


  • ReLU Nonlinearity:
    • The standard activation functions: tanh or sigmoid (saturating nonlinearities).
    • With gradient descent, tanh and sigmoid are much slower than the non-saturating nonlinearity f(x) = max(0, x).
    • eurons with non-saturating nonlinearity is called as Rectified Linear Units (ReLUs) [20].
    • Deep CNN with ReLUs train several times faster than their equivalents with tanh units.
    • Using CIFAR-10, a same model to reach 25% training error with ReLUs are about 6 times faster than the same model with tanh function.
    • Jarrett et al. [11] claim that the nonlinearity f(x) = |tanh(x)| works particularly well with their type of contrast normalization followed by local average pooling on the Caltech-101 dataset.
    • Faster learning has a great influence on the performance of large models trained on large datasets.
  • Training on Multiple-GPUs:
    • A single GTX 580 GPU has only 3GB of memory.
    • GPU memory limits the maximum size of the networks that can be trained.
    • 1.2 million training examples are enough to train networks but too big to fit on one GPU.
    • Spread the net across two GPUs.
    • Current GPUs are particularly well-suited to cross-GPU parallelization:
      • Ability to read from and write to one another’s memory directly without going through host machine memory.
      • Put half of the kernels (or neurons) on each GPU
        • GPU communicate only in certain layers.
          • kernels of layer 3 take input from all kernels in layer 2
          • kernels of layer 4 take input only from those kernels in layer 3 reside on the same GPU.
      • Allows for precisely tune the amount of computation.
    • The architecture is similar to that of by Cire¸san et al. [5], except that the columns are not independent.
    • This scheme reduces our top-1 and top-5 error rates by 1.7% and 1.2%, respectively, as compared with a net with half as many kernels in each convolutional layer trained on one GPU.
    • The two-GPU net takes slightly less time to train than the one-GPU net2.
  • Local Response Normalization:
    • ReLUs do not require input normalization to prevent them from saturating.
    • However, the paper found that still find that the proposed local normalization scheme aids generalization.
    • The paper applied this normalization after applying the ReLU nonlinearity in certain layers.
    • The local contrast normalization resembles [11], but it:
      • Does not substract the mean activity
      • Reduces top-1 and top-5 error rates by 1.4% and 1.2%.
      • Verified on CIFAR-10, a four-layer CNN achieved test error rate:
        • 13% without normalization
        • 11% with normalization
  • Overlapping Pooling:
    • Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map.
    • Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap (e.g., [17, 11, 4]).
    • The pooling used throughout the proposed network is overlapping.
    • The overlapping pooling reduces the top-1 and top-5 error rates by 0.4% and 0.3% as compared to the non-overlapped pooling.
    • The paper observed that models with overlapping pooling is slightly more difficult to overfit.
  • Overall Architecture:
    • Eight learned layers — five convolutional and three fully-connected.
      • Five convolutional layers, some of which are followed by max-pooling layers,
      • Three fully-connected layers with a final 1000-way softmax.
    • Maximizes the multinomial logistic regression.
    • Has 60 million parameters and 650,000 neurons

Reducing Overfitting:

  • Data Augmentation:
    • The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5).
    • The transformation criteria:
      • Allow very little computation
      • The transformed images do not need to be stored on disk.
      • Implementation in the paper: the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images.
    • Two distinct data augmentation techniques:
      • Generating image translations and horizontal reflections.
      • Altering the intensities of the RGB channels in training images.
  • Dropout:
    • Dropout [10] setting to zero the output of each hidden neuron with probability 0.5.
    • The dropout neurons do not contribute to the forward pass and backpropagation.
    • So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights.
    • Advantages:
      • Reduces complex co-adaptations of neurons,
      • Forced to learn more robust features
      • Reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.
    • Dropout roughly doubles the number of iterations required to converge.


  • In the ILSVRC-2012 competition, a variant of this model win top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
  • On the test data, the proposed method achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art.

My Comments

  • This paper clarifies the need to use big data set to obtained good results on Image classification. The bigger the data set, the more parameters are needed to capture the large variation. The more parameters system is prone to overfitting. Some techniques to address this overfitting problem are addressed such as dropout and data augmentation.
  • In our try to apply deep CNN for our study, we found that finding appropriate parameter such as optimum or appropriate learning rate, number of epoch, and the batchsize is not easy. I hope there would be a special paper that state some guidelines in quickly find these parameters.

Review on A Deep Learning that Predict How We Pose from Motion

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews a deep learning that predict how we pose using motion features: MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation.


  • Human body pose recognition in video is a long-standing problem in computer vision with a wide range of applications.
  • Rather than motion-based features, computer vision approaches tend to rely on appearance cues:
    • Texture patches,
    • Edges,
    • Color histograms,
    • Foreground silhouettes,
    • Hand-crafted local features (such as histogram of gradients (HoG) [2])
  • Psychophysical experiments [3] have shown that motion is a powerful visual cue that alone can be used to extract high-level information, including articulated pose.


  • A convolutional neural network for articulated human pose estimation in videos, which incorporates both color and motion features.
  • Significantly better performance than current state-of-the-art pose detection systems.
  • Successfully incorporates motion-features to enhance the performance of pose-detection ‘in-the-wild’.
  • Achieves close to real-time frame rates, making it suitable for wide variety of applications.
  • A new dataset called FLIC-motion: the FLIC dataset [1] augmented with ‘motion-features’ for each of the 5003 images collected from Hollywood movies.


  • Body pose recognition remains a challenging problem due to:
    • High dimensionality of the input data
    • High variability of possible body poses.

Previous Works

  • Previous work [4, 5]:
    • Motion features has had little or no impact for pose inference.
    • Adding high-order temporal connectivity to traditional models would most often lead to intractable inference.
  • The proposed paper shows:
    • Deep learning is able to successfully incorporate motion features.
    • Deep learning is able to out-perform existing state-of-the-art techniques.
    • Using motion features alone, the proposed method outperforms [6, 7, 8]
    • These results further strengthens the claim that information coded in motion features is valuable and should be used when available.

Geometric Model Based Tracking:

  • Articulated tracking systems (1983 – 2007):
    • The earliest (in 1983), Hogg [9] using edge features and a simple cylinder based body model.
    • More recent (1995 – 2001) [10,11, 12,13, 14, 15,16]:
      • The models used in these systems were explicit 2D or 3D jointed geometric models.
      • Most systems had to be hand-initialized (except [12])
      • Focused on incrementally updating pose parameters from one frame to the next.
    • More recent (2007 – 2010):
      • More complex examples come from the HumanEva dataset competitions [17]
      • Use video or higher-resolution shape models such as SCAPE [18] and extensions.
    • Complete survey of this era [19].
  • Most recently (2008 – 2011), techniques to create very high-resolution animations of detailed body and cloth deformations [20, 21, 22].
  • Key difference of the proposed approach: dealing with single view videos in unconstrained environments.

Statistical Based Recognition:

  • No explicit geometric model:
  • The earliest (in 1995)[23], using oriented angle histograms to recognize hand configurations.
    • This was the precursor for
      • The bag-of-features,
      • SIFT [24],
      • STIP [25],
      • HoG, and Histogram of Flow (HoF) [26]
      • Dalal and Triggs in 2005 [27].
  • Shape-context edge-based histograms from the human body [ 28, 29]
  • Shape-context from silhouette features [30].
  • Learn a parameter sensitive hash function to perform example-based pose estimation [31].
  • Extract, learn, or reason over entire body features, using a combination of local detectors and structural reasoning:
    • Coarse tracking [32]
    • Person-dependent tracking [33]
  • “Pictorial Structures” [34]
  • Matching pictorial structures efficiently to images using ‘Deformable Part Models’ (DPM) in [35] in 2008.
  • Many algorithms use DPM for creating the body part unary distribution [ 36, 6, 7, 37] with spatial-models incorporating body-part relationship priors.
  • A cascade of body part detectors to obtain more discriminative templates [38].
  • Almost all best performing algorithms since have solely built on HoG and DPM for local evidence, and more sophisticated spatial models.
  • Pishchulin [39] proposes a model that augments the DPM unaries with Poselet conditioned [40] priors.
  • Sapp and Taskar [1] propose a model where they cluster images in the posespace and then find the mode which best describes the input image.
  • The pose of this mode then acts as a strong spatial prior, whereas the local evidence is again based on HoG and gradient features.
  • Poselets approach [40]
  • The Armlets approach [41]:
    • Incorporates edges, contours, and color histograms in addition to the HoG features.
    • Employ a semi-global classifier for part configuration
    • Show good performance on real-world data.
    • They only show their results on arms.
  • The major drawback of all these approaches is that both the local evidence and the global structure is hand crafted.
  • Key difference of the proposed method: Jointly learn both the local features and the global structure using a multi-resolution convolutional network.
  • An ensemble of random trees to perform per-pixel labeling of body parts in depth images [42].
    • To reduce overall system latency and avoiding repeated false detections, they focuses on pose inference using only a single depth image.
  • The proposed approach:
    • Extend the single frame requirement to at least 2 frames (considerably improves pose inference)
    • The input is unconstrained RGB images rather than depth.

Pose Detection Using Image Sequences:

Deep Learning based Techniques:

  • State-of-the-art performance on many vision tasks using deep learning [ 43, 44, 45, 46, 47, 48].
  • [49, 50, 51] also apply neural networks for pose recognition.
  • Toshev et al. [49] show better than state-of-the-art performance on the ‘FLIC’ and ‘LSP’ [52] datasets.
  • In contrast to Toshev et al., the proposed work introduce a translation invariant model which improves upon the previous method, especially in the high-precision region.

Body-Part Detection Model

  • The paper proposes a Convolutional Network (ConvNet) architecture for estimating the 2D location of human joints in video.
    • The input to the network is an RGB image and a set of motion features.
    • Investigate a wide variety of motion feature formulations.
    • Introduce a simple Spatial-Model to solve a specific sub-problem associated with evaluation of our model on the FLIC-motion dataset.

Motion Features

  • Aim for :
    • The true motion-field: the perspective projection of the 3D velocity-field of moving surfaces
    • Incorporate features that are representative of the true motion-field
    • Exploit motion as a cue for body part localization.
  • Evaluate and analyze four motion features which fall under two broad categories:
    • Using simple derivatives of the RGB video frames
    • Using optical flow features.
    • For each RGB image pair, the paper propose the following features:
      • RGB Image pair
      • RGB image and an RGB difference image
      • Optical-flow vectors
      • Optical-flow magnitude
  • The RGB image pair:
    • The simplest way of incorporating the relative motion information between the two frames.
    • Suffers from a lot of redundancy (i.e. if there is no camera movement)
    • Extremely high dimensional.
    • Not obvious what changes in this high dimensional input space are relevant temporal information and what changes are due to noise or camera motion.
  • A simple modification to image-pair representation is to use a difference image:
    • Reformulates the RGB input so that the algorithm sees directly the pixel locations where high energy corresponds to motion
    • Alternatively the network would have to do this implicitly on the image pair.
  • A more sophisticated representation is optical-flow:
    • High-quality approximation of the true motion-field,
    • Infer optical-flow from the raw RGB input would be nontrivial for the network to estimate,
    • Perform optical-flow calculation as a pre-processing step (at the cost of greater computational complexity).

FLIC-motion dataset:

  • The paper proposes a new dataset which is called FLIC-motion3.
  • It is comprised of:
    • The original FLIC dataset of 5003 labeled RGB images collected from 30 Hollywood movies,
    • 1016 images from the original FLIC are held out as a test set, augmented with the aforementioned motion features.
  • Experimentation with several length of frame difference between image pair.
  • Wrap one of the image pair using inverse of best fitting projection between the image pair to remove camera motion.

Convolutional neural network:

  • Recent work [ 49, 50] has shown ConvNet architectures are well suited for the task of human body pose detection
  • Due to the availability of modern Graphics Processing Units (GPUs), we can perform Forward Propagation (FPROP) of deep ConvNet architectures at interactive frame-rates.
  • Similarly, we can realize pose detection model as a deep ConvNet architecture.
    • Input: a 3D tensor containing an RGB image and its corresponding motion features.
    • Output: a 3D tensor containing response-maps, with one response-map for each joint.
    • Each response-map describes the per-pixel energy for the presence of the corresponding joint at that pixel location.
    • Based on a sliding-window architecture.
    • The input patches are first normalized using:
      • Local Contrast Normalization (LCN [53]) for the RBG channels
      • A new normalization for motion features that is called Local Motion Normalization (LMN)
        • Local subtraction with the response from a Gaussian kernel with large standard deviation followed by a divisive normalization.
        • It removes some unwanted background camera motion as well as normalizing the local intensity of motion
        • Helps improve network generalization for motions of varying velocity but with similar pose.
    • Prior to processing through the convolution stages, the normalized motion channels are concatenated along the feature dimension with the normalized RGB channels.
    • The resulting tensor is processed though 3 stages of convolution:
      • Rectified linear units (ReLU)
      • Maxpooling
      • A single ReLU layer.
    • The output of the last convolution stage is then passed to a three stage fully-connected neural network.
    • The network is then applied to all 64 × 64 sub-windows of the image, stepped every 4 pixels horizontally and vertically to produce a dense response-map output, one for each joint.
    • The major advantage: the learned detector is translation invariant by construction.

Simple Spatial Model

  • The test images in FLIC-motion may contain multiple people, however, only a single actor per frame is labeled in the test set.
  • A rough torso location of the labeled person is provided at test time to help locate the “correct” person.
  • Incorporate the rough torso location information by means of a simple and efficient Spatial-Model.
  • The inclusion of this stage has two major advantages:
    • The correct feature activation from the Part-Detector output is selected for the person for whom a ground-truth label was annotated.
    • Since the joint locations of each part are constrained in proximity to the single ground-truth torso location, then (indirectly) the connectivity between joints is also constrained, enforcing that inferred poses are anatomically viable


  • Training time for our model on the FLIC-motion dataset (3957 training set images, 1016 test set images) is approximately 12 hours, and FPROP of a single image takes approximately 50ms (on 12 cores workstation with NVIDIA Titan GPU)
  • For the proposed models that use optical flow as a motion feature input, the most expensive part of our pipeline is the optical flow calculation, which takes approximately 1.89s per image pair.
  • Plan to investigate real-time flow estimations in the future.

Comparison with Other Techniques

  • Compares the performance of our system with other state of-the-art models on the FLIC dataset for the elbow and wrist joints:
    • The proposed detector is able to significantly outperform all prior techniques on this challenging dataset. Note that using only motion features already outperforms [6, 7, 8].
    • Using only motion features is less accurate than using a combination of motion features and RGB images, especially in the high accuracy region. This is because fine details such as eyes and noses are missing in motion features.
    • Toshev et al. [49] suffers from inaccuracy in the high-precision region, which we attribute to inefficient direct regression of pose vectors from images.
    • MODEC [1], Eichner et al. [6] and Sapp et al. [8] build on hand crafted HoG features. They all suffer from the limitations of HoG (i.e. they all discard color information, etc).
    • Jain et al. [50] do not use multi-scale information and evaluate their model in a sliding window fashion, whereas we use the ‘one-shot’ approach.

My Review

  • This paper lists a comprehensive and systematic references of literatures on human pose estimation study.
  • The new idea is the use of motion features for pose estimation, which is embedded to appearance features deliver the current best performance.
  • The estimated pose is 2D location of human joints.
  • Some questions come up after reading the paper:
    • How this will be applied for 3D pose estimation?
    • How this can be integrated into 3D motion sensor estimation such as kinect for game applications?

Review on A Paper that Combines Gabor Filter and Convolutional Neural Networks for Face Detection

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews a paper that combines Gabor filters and convolutional neural networks:Face Detection Using Convolutional Neural Networks and Gabor Filters for detecting facial regions in the image of arbitrary size.


  • Detecting and locating human faces in an image or video has many applications:
    • human-computer interaction,
    • model based coding of video at very low bitrates
    • content-based video indexing.
  • Why Gabor filter?
    • Biologically motivated since it models the response of huban visual cortical cells [3],
    • Remove most of variation in lighting and contrast,
    • Reduce intrapersonal variation,
    • Robust against small shifts and small object deformations,
    • Allow analysis of signals at different scales, or resolution,
    • Accommodate frequency and position simultaneously.
  • Why convolutional neural networks [6]?
    • Incorporates prior knowledge about the input signal and its distortions into its architecture.
    • Specifically designed to cope with the variability of 2D shapes to be recognized.
    • Combine local feature fields and shared weights
    • Utilize spatial subsampling to ensure some level of shift, scale and deformation invariance.
    • Using the local receptive fields the neurons can extract simple visual features such as corners, end-points. These elementary features are then linked by the succeeding layers to detect more complicated features.


  • Build a method for detecting facial regions in the image of arbitrary size.

Key Ideas:

  • Combining a Gabor filter and a convolutional neural network.
  • Uses Gabor filter-based features instead of the raw gray values as the input for a convolutional neural network.


  1. Uses the Gabor filter which extracts intrinsic facial features. As a result of this transformation we obtain four subimages.
  2. Apply the convolutional neural network to the four images obtained.


  • In complex scenes, human faces may appear in different
    • scales,
    • orientations,
    • head poses.
  • Human face appearance could change considerably due to change of
    • lighting condition,
    • facial expressions,
    • shadows,
    • presence of glasses.

Previous Works:

  • Facial regions detection:
    • Support vector machines [10],
    • Bayesian classifiers [10],
    • Neural networks [7][5].
  • Face knowledge-based detector [7]
  • Finding frontal faces [8]
  • Gabor filter-based features for face recognition [10] but no proposed method for face detection.
  • Faces detection in static images using CNN [5][2]

Features Extraction:

  • Two different orientations and two different wavelengths are utilized.
  • Different facial features are selected, depending on the response of each filter.
  • In frontal or near frontal face image the eyes and mouth are oriented horizontally, while the nose constitutes vertical orientation.
  • The Gabor wavelet capable to select localized variation in image intensity.

Convolutional Neural Algorithm:

  • Contains a set of layers each of which consists of one or more planes.
  • Each unit in the plane is connected to a local neighborhood in the previous layer.
  • The unit can be seen as a local feature detector whose activation characteristic is determined in the learning stage.
  • The outputs of such a set of units constitute a feature map.
  • Units in a feature map are constrained to perform the same operation on different parts of the input image or previous feature maps, extracting different features from the same image.
  • A feature map can be obtained in a sequential manner through scanning the input image.
  • The scanning operation is equivalent to a convolution with a small kernel.
  • The feature map can be treated as a plane of units that share weights.
  • The subsampling layers introduce a certain level of invariance to distortions and translations.
  • Features detected by the units in the successive layers are:
    • decreasing spatial resolution
    • increasing complexity
    • increasing globality.
  • Training the network in a supervised manner using the back-propagation algorithm which has been adapted for convolutional neural networks.
  • The partial derivatives of the activation function with respect to each connection have been computed, as if the network were a typical multi-layer one.
  • The partial derivatives of all the connections that share the same parameter have been added to construct the derivative with respect to that parameter.

Convolutional Neural Architecture:

  • 6 layers.
  • Layer C1 (convolutional layer 1):
    • Performs a convolution on the Gabor filtered images using an adaptive mask.
    • The weights in the convolution mask are shared by all the neurons of the same feature map.
    • The receptive fields of neighboring units overlap.
    • The size of the scanning windows was chosen to be 20×20 pixels.
    • The size of the mask is 5×5
    • The size of the feature map of this layer is 16×16.
    • The layer has 104 trainable parameters.
  • Layers S1 (subsampling layer 1):
    • The averaging/subsampling layer.
    • partially connected to C2.
    • The task is to discover the relationships between different features.
  • Layer C2 (convolutional layer 2):
    • Composed of 14 feature maps.
    • Each unit contains one or two receptive fields of size 3×3 which operate at identical positions within each S1 maps.
    • The first eight feature maps use single receptive fields.
    • Form two independent groups of units responsible for distinguishing between face and nonface patterns.
    • The remaining six feature maps take inputs from every contiguous subsets of two feature maps in S1. This layer has 140 free parameters.
  • Layer S2 (subsampling layer 2):
    • Consists of of 4 planes of size 16 by 16.
    • Each unit in one of these planes receives four inputs from the corresponding plane in C1.
    • Receptive fields do not overlap and all the weights are equal within a single unit.
    • Therefore, this layer performs a local averaging and 2 to 1 subsampling.
    • The number of trainable parameters utilized in this layer is 8.
    • Once a feature has been extracted through the first two layers its accurate location in the image is less substantial and spatial relations with other features are more relevant.
    • Plays the same role as the layer S1.
    • It is constructed of 14 feature maps and has 28 free parameters.
    • In the next layer each of 14 units is connected only to the corresponding feature map of the S2 layer. It has 140 free parameters.
    • The output layer has one node that is fully connected to the all the nodes from the previous layer.
    • The network contains many connections but relatively few free trained parameters due to weight sharing.
    • Weight sharing:
      • Considerably reduce the number of free parameters
      • Improves the generalization capability.


  • The recognition performance is dependent on the size and quality of the training set.
  • The face detector was trained on 3000 non-face patches collected from about 1500 images and 1500 faces covering out-of-plane rotation in the range 20º,,20º.
  • All faces were manually aligned by eyes position.
  • For each face example the synthesized faces were generated by random in-plane rotation in the range 10º,,10º, random scaling about ±10%, random shifting up to ±1 pixel and mirroring.
  • All faces were cropped and rescaled to windows of size 20×20 pixels while preserving their aspect ratio
  • Such a window size is considered in the literature as the minimal resolution that can be used without loosing critical information from the face pattern.
  • The training collection contains also images acquired from our video cameras.
  • The most of the training images which were obtained from WWW are of very good quality.
  • The images obtained from cameras are of second quality,
  • To provide more false examples, perform a training with bootstrapping [7].
  • By using bootstrapping we iteratively gathered examples which were close to the boundaries of face and non-face clusters in the early stages of training.
  • The activation function in the network was a hyperbolic tangent.
  • Training the face detector took around 60 hours on a 2.4 GHz Pentium IV-based PC.
  • There was no overlap between the training and test images.

Testing Experiments:

  • Camera sensor: binocular Megapixel Stereo Head (for testing).
  • A skin color detector is the first classifier in our system [4].
  • To find the faces the detector moves a scanning subwindow by a pre-determined number of pixels within only skin-like regions.
  • The output of the face detector is then utilized to initialize our face/head tracker [4].
  • The detector operates on images of size 320×240 and can process 2-5 images per second depending on the image structure.
  • To estimate the recognition performance, the paper used only the static gray images.


  • Test data-set containing 1000 face sampes and 10000 non-face samples
  • Obtained detection rate is 87.5%.
  • Using only the convolutional network the detection rate is only 79%.
  • The used network structure is relatively simple to provide face detection in real-time using the available computational resources.
  • It is much easier to train a convolutional neural network using a Gabor filtered input images than a network which uses raw images or histogram equalized images.


  • The experimental results is promising both in detection rates and processing speed.
  • The Gabor filter captures efficient features for a convolutional neural network.
  • Much better recognition performance than using the convolutional neural network alone.
  • Achieves high face detection rates and real-time performance due to no exhaustive searching on the whole image.

My review:

  • Even though quite old (ICANN 2005), the paper is interesting because it is the first paper that combines Gabor wavelet filters as input to convolutional neural networks.
  • The paper stated that the recognition result with Gabor filtered input is much better than convolutional neural networks alone.
  • According to Steffan Duffner dissertation (on 2007) on page 120, however, stated as follows:

Our experimental results show that using the image intensities as input of the CNN yields to the best results compared to gradient images and Gabor wavelet filter responses.

  • Thus, the results seem contradict the finding of Steffan Duffner dissertation.
  • It is natural that we come up with more questions on this:
    • What is the effect of using stereo camera during testing? How if using monocular camera?
    • Does Gabor filter will cause overfitting?
  • The overhead computation time of computing the Gabor filter for each training and testing images seems hinders the wide use of this system. One breakthrough may be achieved if we can compute the Gabor filters directly inside the CNN, hence inline with the training process. We can hope that such mechanism will also maintained good results from the system.

Review on Deep Learning for Signal Processing

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews Deep Learning and Its Applications to Signal and Information Processing


  • Signal processing research significantly widened its scope [4].
  • Machine learning has been an important technical area of the signal processing.
  • Since 2006, deep learning—a new area of machine learning research—has emerged [7], impacting a wide range of signal and information processing.


  • Introduce the emerging technologies enabled by deep learning.
  • Review the research in deep learning which relevant to signal processing.
  • Point out the future research directions.
  • Provide a brief survey of deep learning applications in three main categories:
    1. Speech and audio
    2. Image and video
    3. Language processing and information retrieval

Introduction to Deep Learning:

  • Traditional machine learning and signal processing exploit shallow architectures (contain a single layer of nonlinear feature transformation) such as:
    • Hidden Markov models (HMMs),
    • Linear or nonlinear dynamical systems,
    • Conditional random fields (CRFs),
    • Maximum entropy (MaxEnt) models,
    • Support vector machines (SVMs),
    • Kernel regression,
    • Multilayer perceptron (MLP) with a single hidden layer.
  • SVM is a shallow linear separation model with one feature transformation layer when kernel trick is used, and with zero feature transformation layer when kernel trick is not used.
  • Human information processing mechanisms (e.g., vision and speech), need deep architectures for extracting complex structure and building internal representation from rich sensory inputs (e.g., natural image and its motion, speech, and music).
  • Human speech production and perception systems are layered hierarchical structures that transform information from the waveform level to the linguistic level and vice versa.
  • Processing human information media signals would keep be advance if efficient and effective deep learning algorithms are developed.
  • Signal processing systems with deep architectures are composed of many layers of nonlinear processing stages, where each lower layer’s outputs are fed to its immediate higher layer as the input.
  • Two key properties of successful deep learning techniques:
    • The generative nature of the model, which requires an additional top layer to perform the discriminative task
    • Unsupervised pretraining that effectively uses large amounts of unlabeled training data for extracting structures and regularities in the input features.

Brief history:

  • The concept of deep learning originated from artificial neural network research.
  • Multilayer perceptron with many hidden layers is a good example of deep architectures.
  • Backpropagation is a well-known algorithm for learning the weights of multilayer perceptron.
  • Backpropagation alone does not work well with more than a small number of hidden layers (see a review and analysis in [1]).
  • The pervasive presence of local optima in the nonconvex objective function of the deep networks is the main source of difficulty in learning.
  • Backpropagation is based on local gradient descent and starts usually at some random initial points.
  • Backpropagation often gets trapped in local optima and the severity increases significantly as the depth increases.
  • Due to local optima problem, many machine learning and signal processing research steered away from neural networks to shallow models that have convex loss functions (e.g., SVMs, CRFs, and MaxEnt models) for which global optimum can be efficiently obtained at the cost of less powerful models.
  • An unsupervised learning algorithm, which efficiently alleviates local minima problem, was introduced in 2006 by Hinton et al. [7] for a class of deep generative models that is called deep belief networks (DBNs).
  • A core component of the DBN is a greedy, layer-by-layer learning algorithm that optimizes DBN weights at time complexity linear to the size and depth of the networks.
  • Separately and with some surprise, initializing the weights of an MLP with a correspondingly configured DBN often produces much better results than that with the random weights [1], [5].
  • Deep networks that are learned with unsupervised DBN pretraining followed by the backpropagation fine-tuning are also called DBNs (e.g., [8] and [9]).
  • DBN attractive properties:
    1. Makes effective use of unlabeled data;
    2. Can be interpreted as Bayesian probabilistic generative models;
    3. Hidden variables in the deepest layer are efficient to compute;
    4. The overfitting problem (often observed in the models with millions of parameters such as DBNs), and the underfitting problem (often occurred in deep networks) are effectively addressed by the generative pre-training step.
  • Since the publication of the seminal work of [7], numerous researchers have been improving and applying the deep learning techniques with success.
  • Another popular technique is to pretrain the deep networks layer by layer by considering each pair of layers as a denoising auto-encoder [1].

Applications of Deep Learning to Signal Processing:

  • Technical scope of signal processing expands from traditional types of signals (audio, speech, image and video), now also includes text, language, and document to convey high-level, semantic information for human consumption.
  • The scope of processing has been extended from the conventional coding, enhancement, analysis, and recognition to include more human-centric tasks of interpretation, understanding, retrieval, mining, and user interface [4].
  • The signal processing areas can be defined by a matrix constructed with the two axes of “signal” and “processing”.
  • The deep learning techniques have recently been applied to quite a number of extended signal processing areas.

Speech and audio:

  • The traditional MLP has been in use for speech recognition for many years.
  • Used alone, MLP performance is typically lower than the state-of-the-art HMM systems with observation probabilities approximated with Gaussian mixture models (GMMs).
  • Deep learning technique was successfully applied to phone [8], [9]). and large vocabulary continuous speech recognition (LVCSR) by integrating the powerful discriminative training ability of the DBNs and the sequential modeling ability of the HMMs.
  • Such a model is typically named DBN-HMM, where the observation probability is estimated using the DBN and the sequential information is modeled using the HMM.
  • In [9], a five-layer DBN was used to replace the Gaussian mixture component of the GMM-HMM and the monophonestate was used as the modeling unit.
  • Although the monophone model was used, the DBN-HMM approach achieved competitive phone recognition accuracy with the state-of-the-art triphone GMM-HMM systems.
  • DBN-CRF in [8], improved the DBN-HMM used in [9] by using the CRF instead of the HMM to model the sequential information and by applying the maximum mutual information (MMI) in training speech recognition.
  • The sequential discriminative learning technique developed in [9] jointly optimizes the DBN weights, transition weights, and phone language model and achieved higher accuracy than the DBN-HMM phone recognizer with the frame-discriminative training criterion implicit in the DBN’s fine-tuning procedure implemented in [9].
  • The DBN-HMM can be extended from the context-independent model to the context-dependent model and from the phone recognition to the LVCSR.
  • Experiments on the challenging Bing mobile voice search data set collected under the real usage scenario demonstrate that the context-dependent DBN-HMM significantly outperforms the state-of-the-art HMM system.
  • Three factors contribute to the success of context-dependent DBN-HMM:
    • Triphone senones as the DBN modeling units,
    • Triphone GMM-HMM to generate the senone alignment,
    • the tuning of the transition probabilities.
  • Experiments indicate decoding time of a five-layer DBN-HMM is almost as that of the state-of-the-art triphone GMM-HMM.
  • In [5], the deep auto-encoder [7] is explored for speech feature coding with the goal to compress the data to a predefined number of bits with minimal reproduction error.
  • DBN pretraining is found to be crucial for high coding efficiency.
  • When DBN pretraining is used, the deep auto-encoder is shown to significantly outperform a traditional vector quantization technique.
  • If weights in the deep auto-encoder are randomly initialized, the performance is substantially degraded.
  • Another popular deep model: convolutional DBN
  • Application of convolutional DBN to audio and speech data shows strong result for music artist and genre classification, speaker identification, speaker gender classification, and phone classification.
  • Deep-structured CRF, which stacks many layers of CRFs, have been successfully used in the speech-related task of language identification, phone recognition, sequential labeling [15], and confidence calibration.

Image and video:

  • The original DBN and deep auto-encoder (AE) were developed and success on the simple image recognition and dimensionality reduction (coding) tasks (MNIST) in [7].
  • Interesting finding: the gain of coding efficiency of DBN-based auto-encoder (on the image data) over the conventional method of principal component analysis as demonstrated in [7] is very similar to the gain reported in [5] on the speech data over the traditional technique of vector quantization.
  • In [10], Nair and Hinton developed a modified DBN where the top-layer uses a third-order Boltzmann machine.
    • Apply the modified DBN to the NORB database—a three-dimensional object recognition task.
    • Report an error rate close to the best published result on this task.
    • DBN substantially outperforms shallow models such as SVMs.
  • Tang and Eliasmith developed two strategies to improve the robustness of the DBN in [14].
    1. Use sparse connections in the first layer of the DBN as a way to regularize the model.
    2. Developed a probabilistic denoising algorithm. Both techniques are shown to be effective in improving the robustness against occlusion and random noise in a noisy image recognition task.
  • Image recognition with a more general approach than DBN appears in [11])].
  • DBNs have also been successfully applied to create compact but meaningful representations of images for retrieval purposes.
  • On the large collection image retrieval task, deep learning approaches also produced strong results.
  • The use of conditional DBN for video sequence and human motion synthesis was reported in [13].
  • The conditional DBN makes the DBN weights associated with a fixed time window conditioned on the data from previous time steps.
  • Temporal DBN opens opportunity to improve the DBN-HMM towards efficient integration of temporal-centric human speech production mechanisms into DBN-based speech production models.

Language processing and information retrieval:

  • Research in language, document, and text processing has seen increasing popularity in signal processing research.
  • The society’s audio, speech, and language processing technical committee designated language, document, ant text processing as one of the main focus area.
  • Long history of using (shallow) neural networks in language modeling (LM)—an important component in speech recognition, machine translation, text information retrieval, and in natural language processing.
  • Recently, a DBN-HMM model was used for speech recognition. The observation probabilities are estimated using the DBN. The state values can be syllables, phones, subphones, monophone states, or triphone states and senones.
  • Temporally factored RBM has been used for LM. Unlike the traditional N-gram model, the factored RBM uses distributed representations not only for context words but also for the words being predicted. This approach can be directly generalized to deeper structures.
  • Collobert and Weston [2] developed and employed a convolutional DBN as the common model to simultaneously solve a number of classic problems including part-of-speech tagging, chunking, named entity tagging, semantic role identification, and similar word identification.
  • A similar multitask learning technique with DBN is used in [3] to attack the machine transliteration problem, which may be generalized to the more difficult problem of machine translation.
  • DBN and deep autoencoder are used for document indexing and retrieval [ [11], [12].
    • The hidden variables in the last layer are easy to infer.
    • Gives a much better representation of each document (based on the word-count features) than the widely used latent semantic analysis.
    • Using compact code produced by deep networks, documents are mapped to memory addresses in such a way that semantically similar text documents are located at nearby address to facilitate rapid document retrieval.
    • This idea is explored for audio document retrieval and speech recognition [5].


  • Deep learning have already demonstrated promising results in many signal processing applications.

Future directions:

  • Better understanding the deep model and deep learning:
    • Why is learning in deep models difficult?
    • Why do the generative pretraining approaches seem to be effective empirically?
    • Is it possible to change the underlining probabilistic models to make the training easier?
    • Are there other more effective and theoretically sound approaches to learn deep models?
  • Better feature extraction models at each layer.
    • Without derivative and accelerator features in the DBN-HMM, the speech recognition accuracy is significantly reduced.
    • The current Gaussian-Bernoulli layer is not powerful enough to extract important discriminative information from the features.
    • Using a three-way associative model called mcRBM, derivative and accelerator features are no longer needed to produce state-of-the-art recognition accuracy.
    • No reason to believe mcRBM is the best first-layer model for feature extraction either.
    • Theory needs to be developed to guide the search of proper feature extraction models at each layer.
  • More powerful discriminative optimization techniques.
    • Although current strategy of generative pretraining followed by discriminative fine-tuning seems to work well empirically for many tasks, it failed to work for some other tasks such as language identification.
    • The features extracted at the generative pretraining phase seem to describe the underlining speech variations well but do not contain enough information to distinguish between different languages.
    • A learning strategy that can extract discriminative features for language identification tasks is in need.
    • Extracting discriminative features may also greatly reduce the model size needed in the current deep learning systems.
  • Better deep architectures for modeling sequential data.
    • The existing approaches, such as DBN-HMM and DBN-CRF, represent simplistic and poor temporal models.
    • Models that can use DBNs in a more tightly integrated way and learning procedures that optimize the sequential criterion are important to further improve the performance of sequential classification tasks.
  • Adaptation techniques for deep models.
    • Many conventional models such as GMM-HMM have well-developed adaptation techniques that allow for these models to perform well under diverse and changing real-world environments.
    • Without effective adaptation techniques, deep techniques cannot outperform the conventional models when the test set is different from the training set, which is common in real applications.

My Review:

  • This is an introductory and easy reading on the application of deep learning to continuously expanding area of signal processing.
  • The deep learning slightly biased towards DBN.
  • The referred convolutional DBN is actually convolutional NN.
  • Future directions is the most interesting part.

Review on A Deep Learning for Sleep Analysis

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews on a deep learning for sleep analysis: Sleep Stage Classification Using Unsupervised Feature Learning

Source code: Matlab code used in the paper is available at http://aass.oru.se/~mlt/


  • Multimodal sleep data is very complex.
  • Feature extraction of sleep data is difficult and time consuming.
  • The size of the feature space can grow, which ultimately needs feature selection.
  • Unsupervised feature learning and in particular deep learning [10, 11, 12, 13, 14, 15] propose ways for training the weight matrices in each layer in an unsupervised fashion as a pre-processing step before training the whole network.
  • Deep Learning has proven to give good results in other areas such as vision tasks [10], object recognition [16], motion capture data [17], speech recognition [18], and bacteria identification [19].


  • How to isolate features in multivariate time-series data to be used for correctly identify and automate the annotation process to generate sleep hypnograms.
  • The absence of universally applicable features for training a sleep stage classifier requires a two-stage process: feature extraction and feature selection [1, 2, 3, 4, 5, 6, 7, 8, 9].
  • Inconsistencies between sleep labs (equipment, electrode placement), experimental setups (number of signals and categories, subject variations), and interscorer variability (80% conformance for healthy patients and even less for patients with sleep disorder [9]) make it challenging to compare sleep stage classification accuracy to previous works.


  • The discovery of new useful feature representations that a human expert might not be aware of, which in turn could lead to a better understanding of the sleep process and present a way of exploiting massive amounts of unlabeled data.
  • Unsupervised feature learning, not only removes the need for domain specific expert knowledge, but inherently also provides tools for anomaly detection and noise redundancy.

Addressed problem:

  • Build an unsupervised feature learning architecture which can eliminate the use of handmade features in sleep analysis.

Previous works:

  • The proposed architecture of training the DBN follows previous work with unsupervised feature learning for electroencephalography (EEG) event detection [20].
  • Results in [2] report a best result accuracy of around 61% for classification of 5 stages from a single EEG channel using GOHMM and AR coefficients as features.
  • Works by [8] achieved 83.7% accuracy using conditional random fields with six power spectra density features for one EEG signal on four human subjects during a 24-hour recording session and considering six stages.
  • Works by [7] achieved 85.6% accuracy on artifact-free, two expert agreement sleep data from 47 mostly healthy subjects using 33 features with SFS feature selection and four separately trained neural networks as classifiers.

Key ideas:

  • An alternative to using hand-tailored features derived from expert knowledge is to apply unsupervised feature learning techniques for learning the feature representations from unlabeled data.
  • The main focus is to learn meaningful feature representations from unlabeled sleep data.
  • EEG, EOG, and EMG records is segmented and used to train a deep belief network (DBN), using no prior knowledge.
  • Integrating a hidden Markov model (HMM) and compare classification accuracy with a feature-based approach that uses prior knowledge.
  • The inclusion of an HMM post-processing is to:
    • Improve the capturing of a more realistic sleep stage switching, for example, hinders excessive or unlikely sleep stage transitions.
    • Infuse the human experts knowledge into the system.
  • Even though the classifier is trained using labeled data, the feature representations are learned from unlabeled data.
  • The paper also presents a study of anomaly detection with the application to home environment data collection.

Network architecture:

  • Deep belief networks (DBN).
  • A DBN is formed by stacking a user-defined number of RBMs on top of each other where the output from a lower-level RBM is the input to a higher-level RBM.
  • The main difference between a DBN and a multilayer perceptron is the inclusion of a bias vector for the visible units, which is used to reconstruct the input signal, which plays an important role in the way DBNs are trained.
  • A reconstruction of the input can be obtained from the unsupervised pretrained DBN by encoding the input to the top RBM and then decoding the state of the top RBM back to the lowest level.


  • Two dataset of electroencephalography (EEG) records of brain activity, electrooculography (EOG) records of eye movements, and electromyography records (EMG) of skeletal muscle activity.
    • The first consists of 25 acquisitions and is used to train and test the automatic sleep stager.
    • The second consists of 5 acquisitions and is used to validate anomaly detection on sleep data collected at home.
  • Benchmark Dataset. Provided by St. Vincent’s University Hospital and University College Dublin, which can be downloaded from PhysioNet [29].
  • Home Sleep Dataset. PSG data of approximately 60 hours (5 nights) was collected at a healthy patient’s home using a Embla Titanium PSG. A total of 8 electrodes were used: EEG C3, EEG C4, EOG left, EOG right, 2 electrodes for the EMG channel, reference electrode, and ground electrode.


  • Notch filtering at 50 Hz to cancel out power line disturbances and down- sampled to 64 Hz after being prefiltered with a band-pass filter of 0.3 to 32 Hz for EEG and EOG, and 10 to 32 Hz for EMG.
  • Each epoch before and after a sleep stage switch is removed from the training set to avoid possible subsections of mislabeled data within one epoch. This resulted in 20.7% of total training samples to be removed.

Experiment setup:

  • The five sleep stages that are at focus are:
    • Awake,
    • Stage 1 (S1),
    • Stage 2 (S2),
    • Slow wave sleep (SWS),
    • Rapid eye-movement sleep (REM).
  • These stages come from a unified method for classifying an 8 h sleep recording introduced by Rechtschaffen and Kales (R&K) [22].
  • The goal of this work is not to replicate the R&K system or improve current state-of-the-art sleep stage classification but rather to explore the advantages of deep learning and the feasibility of using unsupervised feature learning applied to sleep data.
  • Therefore, the main method of evaluation is a comparison with a feature-based shallow model.
  • Even though the goal in this work is not to replicate the R&K system, its terminology is used for evaluation of the proposed architecture.
  • A graph that shows these five stages over an entire night is called a hypnogram, and each epoch according to the R&K system is either 20 s or 30 s.
  • While the R&K system brings consensus on terminology, among other advantages [2390099-0/abstract)], it has been criticized for a number of issues [24].
  • Each channel of the data in the proposed study is divided into segments of 1 second with zero overlap, which is a much higher temporal resolution than the one practiced by the R&K system.
  • The paper uses and compares three setups for an automatic sleep stager:
    1. feat-GOHMM: a shallow method that uses prior knowledge.
    2. feat-DBN: a deep architecture that also uses prior knowledge.
    3. raw-DBN, is a deep architecture that does not use any prior knowledge.
  • feat-GOHMM:
    • A Gaussian observation hidden Markov model (GOHMM) is used on 28 handmade features;
    • Feature selection is done by sequential backward selection (SBS), which starts with the full set of features and greedily removes a feature after each iteration step.
    • A principal component analysis (PCA) with five principal components is used after feature selection, followed by a Gaussian mixture model (GMM) with five components.
    • Initial mean and covariance values for each GMM component are set to the mean and covariance of annotated data for each sleep stage.
    • The output from the GMM is used as input to a hidden Markov model (HMM) [25].
  • feat-DBN:
    • A 2-layer DBN with 200 hidden units in both layers and a softmax classifier attached on top is used on 28 handmade features.
    • Both layers are pretrained for 300 epochs, and the top layer is fine-tuned for 50 epochs. Initial biases of hidden units are set empirically to −4 to encouraged sparsity [26], which prevents learning trivial or uninteresting feature representations.
    • Scaling to values between 0 and 1 is done by subtracting the mean, divided by the standard deviation, and finally adding 0.5.
  • raw-DBN:
    • A DBN with the same parameters as feat-DBN is used on preprocessed raw data.
    • Scaling is done by saturating the signal at a saturation constant, sat channel, then divide by 2 ∗ sat channel , and finally adding 0.5. The saturation constant was set to sat EEG = sat EOG = ± 60 μV and sat EMG = ± 40 μV.
    • Input consisted of the concatenation of EEG, EOG1, EOG2, and EMG. With window width, w, the visible layer becomes With four signals, 1 second window, and 64 samples per second, the input dimension is 256.
  • Anomali detection for Home Sleep data:
    • Anomaly detection is evaluated by training a DBN and calculating the root mean square error (RMSE) from the reconstructed signal from the DBN and the original signal.
    • A faulty signal in one channel often affects other channels for sleep data, such as movement artifacts, blink artifacts, and loose reference or ground electrode. Therefore, a detected fault in one channel should label all channels at that time as faulty.
    • All signals, except EEG2, are nonfaulty prior to a movement artifact at t = 7 s. This movement affected the reference electrode or the ground electrode, resulting in disturbances in all signals for the rest of the night, thereby rendering the signals unusable by a clinician. A poorly attached electrode was the cause for the noise in signal EEG2.
    • Previous approaches to artifact rejection in EEG analysis range from simple thresholding on abnormal amplitude and/or frequency to more complex strategies in order to detect individual artefacts [2700060-6/pdf), 28].


  • The results using raw data with a deep architecture, such as the DBN, were comparable to a feature-based approach when validated on clinical datasets.
  • F1-scores of the three setups: feat-GOHMM: 63.9 ± 10.8 feat-DBN: 72.2 ± 9.7 raw-DBN: 67.4 ± 12.9

H/W, S/W and computation time:

  • Windows 7, 64-bit machine with quad-core Intel i5 3.1 GHz CPU with use of a nVIDIA GeForce GTX 470 GPU using GPUmat, simulation time for feat-GOHMM, feat-DBN, and raw-DBN were approximately 10 minutes, 1 hour, and 3 hours per dataset, respectively.


  • Regarding the DBN parameter selection, it was noticed that setting initial biases for the hidden units to −4 was an important parameter for achieving good accuracy.
  • A better way of encourage sparsity is to include a sparsity penalty term in the cost objective function [31] instead of making a crude estimation of initial biases for the hidden units.
  • For the raw-DBN setup, it was also crucial to train each layer with a large number of epochs and in particular the fine tuning step.
  • Replacing HMM with conditional random fields (CRFs) could improve accuracy but is still a simplistic temporal model that does not exploit the power of DBNs [32].
  • While a clear advantage of using DBN is the natural way in which it deals with anomalous data, there are some limitations to the DBN:
    • The correlations between signals in the input data are not well captured. This gives a feature-based approach an advantage where, for example, the correlation between both EOG channels can easily be represented with a feature. This could be solved by either representing the correlation in the input or extending the DBN to handle such correlations, such as a cRBM [33].
    • It has been suggested for multimodal signals to train a separate DBN for each signal first and then train a top DBN with concatenated data [34]. This not only could improve classification accuracy, but also provide the ability to single out which signal contains the anomalous signal.
  • Notice a lower performance if sleep stages were not set to equal sizes in the training set.
  • High variation in the accuracy between patients, even if they came from the same dataset.
  • Increase in the number of layers and hidden units did not significantly improve classification accuracy. Rather, an increase in either the number of layers or hidden units often resulted in a significant increase in simulation time

Future study:

  • The work has explored clinical data sets in close cooperation with physicians, and future work will concentrate on the application for at home monitoring as sleep data is an area where unsupervised feature learning is a highly promising method for sleep stage classification as data is abundant and labels are costly to obtain.

My Review:

  • This is a very interesting paper that demonstrates deep learning gives better classification accuracy (even though the standard deviation is slightly higher) compared to shallow features learning.
  • This paper also explain many interesting insights on how best to train deep belief network for sleep stages analysis.
  • The paper also provides a complete and valuable references not only on deep learning but also sleep stages analysis and scoring from clinical and machine learning aspects.
  • Overall the report is very clear and comprehensive.

Review on A Deep Learning for Sentiment Analysis from Twitter

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews a deep learning for sentiment analysis from twitter: Coooolll: A Deep Learning System for Twitter Sentiment Classification

Addressed problem:

  • Twitter sentiment classification within a supervised learning framework.
  • Generate features unsupervisedly using deep learning from 10M tweets collected by positive and negative emoticons, without any manual annotation.
  • Embed deep learning features with state-of-the-art features.
  • Classify sentiment using the features generated by deep learning and compared the result with using the joint features.

Previous works:

  • Twitter sentiment classification aims to classify the sentiment polarity of a tweet as positive, negative or neutral (Jiang et al., 2011; Hu et al., 2013; Dong et al., 2014).
  • The majority of existing approaches follow Pang et al. (2002) and employ machine learning algorithms to build classifiers from tweets with manually annotated sentiment polarity.
  • Under this direction, most studies focus on designing effective features to obtain better classification performance ( Pang and Lee, 2008; Liu, 2012; Feldman, 2013). For example:

Key ideas:

  • Coooolll is built in a supervised learning framework by concatenating the sentiment-specific word embedding (SSWE) features with the state-of-the-art hand-crafted features (STATE).
  • To obtain large-scale training corpora, train the SSWE from 10M tweets collected by positive and negative emoticons, without any manual annotation.
  • The proposed system can be easily re-implemented with the publicly available sentiment-specific word embedding.
  • Conduct experiments on both positive/negative/neutral and positive/negative classification of tweets.
  • Develop a deep learning for message-level Twitter sentiment classification.
  • Develop a neural network with hybrid loss function to learn SSWE, which encodes the sentiment information of tweets in the continuous representation of words.

Network architecture:

  • The proposed neural network for learning sentiment-specific word embedding is an extension of the traditional C&W model (Collobert et al., 2011).
  • Unlike C&W model that learns word embedding by only modeling syntactic contexts of words, the proposed SSWEu captures the sentiment information of sentences as well as the syntactic contexts of words.
  • Given an original (or corrupted) ngram and the sentiment polarity of a sentence as the input, SSWEu predicts a two-dimensional vector for each input ngram.


  • 10 million tweets from Twitter Sentiment Analysis Track in SemEval 2014


  1. Learn sentiment-specific word embedding (SSWE) (Tang et al., 2014), which encodes the sentiment information of text into the continuous representation of words (Mikolov et al., 2013; Sun et al., 2014).
  2. Concatenate the SSWE features with the STATE (state-of-the-art hand-crafted) features (Mohammad et al., 2013),
  3. Train the sentiment classifier with the benchmark dataset from SemEval 2013 (Nakov et al., 2013). The classifier is LibLinear (Fan et al., 2008)
  4. Test the trained model using test sets of SemEval 2014.


  • SSWE Features:
    • Given an original (or corrupted) ngram and the sentiment polarity of a sentence as the input, SSWEu predicts a two-dimensional vector for each input ngram.
    • The two scalars stand for language model score and sentiment score of the input ngram.
    • The training objectives of SSWEu are
      1. The original ngram should obtain a higher language model score than the corrupted ngram,
      2. The sentiment score of original should be more consistent with the gold polarity annotation of sentence than corrupted.
    • The loss function of SSWEu is the linear combination of two hinge losses: the syntactic loss and the sentiment loss.
    • After finish learning SSWE, explore min, average and max convolutional layers (Collobert et al., 2011; Socher et al., 2011; Mitchell and Lapata, 2010), to obtain the tweet representation.
    • The result is the concatenation of vectors derived from different convolutional layers.
  • STATE Features:
    • Re-implement the state-of-the-art hand-crafted features (Mohammad et al., 2013) for Twitter sentiment classification.
    • The STATE features:
    • All Caps :The number of words with all characters in upper case.
    • Emoticons:The presence of positive (or negative) emoticons and whether the last unit of a segmentation is emoticon.
    • Elongated Units:The number of elongated words (with one character repeated more than two times), such as gooood.
    • Sentiment lexicon:Several sentiment lexicons to generate features:
      • the number of sentiment words,
      • the score of last sentiment words,
      • the total sentiment score and
      • the maximal sentiment score for each lexicon.
    • Negation: The number of individual negations within a tweet.
    • Puctuation: The number of contiguous sequences of dot, question mark and exclamation mark.
    • Cluster: The presence of words from each of the 1,000 clusters from the Twitter NLP tool.
    • Ngrams: The presence of word ngrams (1-4) and character ngrams (3-5).


  • Among the 45 submitted systems including the SemEval 2013 participants, the proposed system (Coooolll) is ranked 2nd on the Twitter2014 test set of SemEval 2014 Task 9.
  • The performance of only using SSWE as features is comparable to the stateof-the-art hand-crafted features which verifies the effectiveness of the sentiment-specific word embedding.

My Review:

  • The paper is a nice reading for it refers to many state of the art sentiment analysis system.
  • The core of the system is the learning sentiment-specific word embedding features using deep learning. The details of algorithm is written on this paper.
  • Deep learning was used only to generate features. If the features extraction and classification can be performed in a single pass, this might brings deeper insights on how the deep classifier will also fine tune the extracted features.

Review on A Deep Learning for Sentiment Analysis

Posted by Mohamad Ivan Fanany This writing summarizes and reviews a deep learning for large-scale sentiment classification (or sentiment analysis): Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach

Printed version


  • The rise of social media such as blogs and social networks, reviews, ratings and recommendations are rapidly proliferating
  • The ability to automatically filter them is a current key challenge for businesses to sell their wares and identify new market opportunities.
  • Why evaluating Deep Learning for sentiment analysis is interesting?:
    • There exist generic concepts that characterize product reviews accross domains.
    • Deep Learning can disentangle the underlying factors of variation.
    • Domain adaptation for sentiment analysis becomes a medium for better understanding deep architectures.
    • Even though Deep Learning have not yet been evaluated for domain adaptation of sentiment classifiers, several very interesting results have been reported on other tasks involving textual data, beating the previous state-of-the-art in several cases (Salakhutdinov and Hinton, 2007; Collobert and Weston, 2008; Ranzato and Szummer, 2008).


  • Reviews can span so many different domains. It is difficult to gather annotated training data for all of them.

Addressed problem:

  • Sentiment classification (or sentiment analysis): determine the judgment of a writer with respect to a given topic based on a given textual comment.
  • Tackle the problem of domain adaptation for sentiment classifiers: a system is trained on labeled reviews from one source domain but is meant to be deployed on another.

Previous works:

  • Sentiment analysis is now a mature machine learning research topic (Pang and Lee, 2008).
  • Applications:
  • Large variety of data sources makes it difficult and costly to design a robust sentiment classifier.
  • Reviews deal with various kinds of products or services for which vocabularies are different.
  • Data distributions are different across domains -> Solutions:
    1. Learn a different system for each domain:
      • High cost to annotate training data for a large number of domains,
      • Cannot exploit information shared across domains.
    2. Learn a single system from the set of domains.
  • The problem of training and testing models on different distributions is known as domain adaptation (Daum´e III and Marcu, 2006).
  • Learning setups relating to domain adaptation have been proposed before and published under different names.
  • Daum´e III and Marcu (2006) formalized the problem and proposed an approach based on a mixture model.
  • Ways to address domain adaptation:
    • Instance weighting (Jiang and Zhai, 2007): in which instance-dependent weights are added to the loss function
    • Data representation: the source and target domains present the same joint distribution of observations and labels.
      • Formal analysis of the representation change (Ben-David et al. (2007))
      • Structure Correspondence Learning (SCL): makes use of the unlabeled data from the target domain to find a low-rank joint representation of the data
    • Ignoring the domain difference (Dai et al., 2007): consider source instances as labeled data and target ones as as unlabeled data.
  • Dai et al., (2007) approach is very close to self-taught learning by Raina et al., (2007) in which one learns from labeled examples of some categories as well as unlabeled examples from a larger set of categories.
  • Like the proposed method in this paper, Raina et al. (2007) relies crucially on the unsupervised learning of a representation.

Inspiration from previous works:

  • RBMs with (soft) rectifier units have been introduced in (Nair and Hinton, 2010). The authors have used such units because they have been shown to outperform other non-linearities on a sentiment analysis task (Glorot et al., 2011).
  • Support Vector Machines (SVM) being known to perform well on sentiment classification (Pang et al., 2002). The authors use a linear SVM with squared hinge loss. This classifier is eventually tested on the target domain(s).

Key ideas:

  • Deep learning for extracting a meaningful representation in an unsupervised fashion.
  • Deep learning for domain adaptation of sentiment classifiers.
  • Existing domain adaptation methods for sentiment analysis focus on the information from the source and target distributions, whereas the proposed unsupervised learning (SDA) can use data from other domains, sharing the representation across all those domains.
  • Such representation sharing reduces the computation required to transfer to several domains because a single round of unsupervised training is required, and allows us to scale well with large amount of data and consider real-world applications.
  • Existing domain adaptation methods for sentiment analysis map inputs into a new or an augmented space using only linear projections. The code learned by the proposed SDA is a non-linear mapping of the input and can therefore encode complex data variations.
  • Rectifier non-linearities have the the nice ability to naturally provide sparse representations (with exact zeros) for the code layer, well suited to linear classifiers and are efficient with respect to computational cost and memory use.

Domain adaptation:

  • The training and testing data are sampled from different distributions.
  • Deep Learning algorithms learns intermediate concepts between raw input and target.
  • These intermediate concepts could yield better transfer across domains.
  • Exploit the large amounts of unlabeled data across all domains to learn these intermediate representations.


  • Amazon data: More than 340,000 reviews regarding 22 different product types and for which reviews are labeled as either positive or negative.
  • Challenges: heterogeneous, heavily unbalanced and large-scale.
  • A smaller and more controlled version has been released:
    • Only 4 different domains: Books, DVDs, Electronics and Kitchen appliances.
    • 1000 positive and 1000 negative instances for each domain
    • A few thousand unlabeled examples.
    • The positive and negative examples are also exactly balanced
  • The reduced version is used as a benchmark in the literature.
  • The paper will contain the first published results on the large Amazon dataset.

Compared methods:

  • Structural Correspondence Learning (SCL) for sentiment analysis (Blitzer et al. 2007)
  • Multi-label Consensus Training (MCT) approach which combines several base classifiers trained with SCL (Li and Zong 2008).
  • Spectral Feature Alignment (Pan et al., 2010)

Learning algorithm:

  • Stacked Denoising Auto-encoder (Vincent et al., 2008).
  • Access to unlabeled data from various domains, but access to the labels for one source domain only.
  • Two-step procedure:
    • Unsupervisedly learn higher-level feature from the text reviews of all the available domains using a Stacked Denoising Autoencoder (SDA) with rectifier units (i.e. max(0, x)).
    • Train a linear classifier on the transformed labeled data of the source domain.


  • Preprocessing follows (Blitzer et al., 2007):
    • Each review text is treated as a bag-of-words and transformed into binary vectors encoding the presence/absence of unigrams and bigrams.
    • Keep 5000 most frequent terms of the vocabulary of unigrams and bigrams in the feature set.
    • Split train/test data.
  • Baseline: a linear SVM trained on the raw data
  • The proposed method is also a linear SVM but trained and tested on data for which features have been transformed by.
  • The hyper-parameters of all SVMs are chosen by crossvalidation on the training set.
  • Explored an extensive set of hyper-parameters:
    • A masking noise probability (its optimal value was usually high: 0.8);
    • A Gaussian noise standard deviation for upper layers;
    • A size of hidden layers (5000 always gave the best performance);
    • An L1 regularization penalty on the activation values.
    • A learning rate.
  • All algorithms were implemented using the Theano library (Bergstra et al., 2010).
  • Basic Metric:
    • Transfer error: the test error obtained by a method trained on the source domain and tested on the target domain.
    • In domain error: the source domain and the tested domain is the same.
    • Test error: the test error obtained by the baseline method, i.e., a linear SVM on raw features, trained and tested on the raw features of the target domain.
    • Transfer loss: the difference between the transfer error and the in domain baseline error.
  • For a large number of heterogeneous domains with different difficulties (as with the large Amazon data), the transfer loss is not satisfactory.
  • Advanced metric:
    • Transfer ratio: it also characterizes the transfer but is defined by replacing the difference by a quotient. This is less sensitive to important variations of in-domain errors, and thus more adapted to averaging.
    • In-domain ratio.
  • Compare the results from the original paper of 3 compared methods (SCL, MCT, SFA), which have been obtained using the whole feature vocabulary and on different splits, but of identical sizes:
    • Results are consistent whatever the train/test splits as long as set sizes are preserved.
    • All baselines achieve similar performances.
  • Compare the results from Transductive SVM (Sindhwani and Keerthi, 2006) trained in a standard semi-supervised setup: the training set of the source domain is used as labeled set, and the training set of the other domains as the unlabeled set:
    • The unsupervised feature extractor is made of a single layer of 5000 units.


  • Sentiment classifiers trained with this high-level feature representation clearly outperform state-of-the-art methods on a benchmark composed of reviews of 4 types of Amazon products.
  • This method scales well and allowed us to successfully perform domain adaptation on a larger industrial-strength dataset of 22 domains.


  • The paper demonstrated that a Deep Learning system based on Stacked Denoising Auto-Encoders with sparse rectifier units can perform an unsupervised feature extraction which is highly beneficial for the domain adaptation of sentiment classifiers.
  • Experiments have shown that linear classifiers trained with this higher-level learnt feature representation of reviews outperform the current state-of-the-art.
  • Furthermore, the paper successfully perform domain adaptation on an industrial-scale dataset of 22 domains, where significantly improve generalization over the baseline and over a similarly structured but purely supervised alternative.

My Review:

  • This paper demonstrates a nice proof that learnt high level features produced by deep learning leads to lower classification error compared to the-state-of-the-art of two-level classifiers.
  • One important motivation for using the high level features is domain adoption problem that can be spesifically addressed by Deep Learning.
  • It would be very nice if the authors put the data online so that we can tested it also using different Deep Learning algorithms and techniques.

Review on Deep Learning for Big Data: Challenges and Perspective

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews a paper on deep learning for big data: Big Data Deep Learning: Challenges and Perspectives


  • Deep learning and Big Data are two hottest trends in the rapidly growing digital world.
  • Big Data: exponential growth and wide availability of digital data
  • The Internet is processing 1,826 Petabytes of data per day [1]
  • Digital information has grown nine times in volume in just five years [2].
  • By 2020, digital information in the world will reach 35 trillion gigabytes [3].
  • Machine learning techniques together with advances in available computational power, play a vital role in Big Data analytics and knowledge discovery [5, 6, 7, 8]
  • Deep learning together with Big Data: “big deals and the bases for an American innovation and economic revolution” [9].
  • Deep learning is coming to play a key role in providing big data predictive analytics solutions.
  • Big Data deep learning that uses great computing power to speed up the training process has shown significant potential in Big Data.

Addressed problems:

  • Provide a brief overview of deep learning, and highlight current research efforts and the challenges to big data, as well as the future trends.
  • Not a comprehensive survey of all the related work in deep learning,

Deep learning:

  • Definition: machine learning techniques that use supervised and/or unsupervised strategies to automatically learn multiple level hierarchical representations in deep architectures for classification [10], [11].
  • Inspired by biological observations on human brain mechanisms for processing of natural signals.
  • State-of-the-arts performance:
    • speech recognition [12], [13],
    • collaborative fultering [14],
    • computer vision [15], [16].
    • Apple’s Siri [17]
    • Google’s deep learning [18
    • Microsoft Bing’s voice search [19]
    • IBM brain-like computer [18, 20
  • Two well-established deep architectures:
    1. Deep belief networks (DBNs) [21, 22, 23]
    2. Convolutional neural networks (CNNs) [24, 25, 26].

    A. Deep Belief Networks:

    • Conventional neural networks:
      1. Prone to get trapped in local optima of a non-convex objective function [27].
      2. Cannot take advantage of unlabeled data, which are often abundant and cheap to collect in Big Data.
    • Deep belief network (DBN) uses a deep architecture that is capable of learning feature representations from both the labeled and unlabeled data presented to it [21].
    • DBN incorporates both unsupervised pre-training and supervised fine-tuning strategies to construct the models.
    • DBN architecture is a stack of Restricted Boltzmann Machines (RBMs) and one or more additional layers for discrimination tasks.
    • RBMs are probabilistic generative models that learn a joint probability distribution of observed (training) data without using data labels [28]
    • DBN can effectively utilize large amounts of unlabeled data for exploiting complex data structures.
    • Once the structure of a DBN is determined, the goal for training is to learn the weights (and biases) between layers.
    • Training is conducted firstly by an unsupervised learning of RBMs.
    • RBM consists of two layers: nodes in one layer are fully connected to nodes in the other layer and there is no connection for nodes in the same layer
    • Train the generative weights of each RBMs using Gibbs sampling [29, 30].
    • Before fine-tuning, a layer-by-layer pre-training of RBMs is performed: the outputs of an RBM are fed as inputs to the next RBM and the process repeats until all the RBMs are pretrained.
    • Layer-by-layer unsupervised learning helps avoid local optima and alleviates the over-fitting problem that is observed when millions of parameters are used.
    • The algorithm time complexity, is linear to the number and size of RBMs [21].
    • Features at different layers contain different information about data structures with higher-level features constructed from lower-level features.
    • Number of stacked RBMs is a parameter predetermined by users and pre-training requires only unlabeled data (for good generalization).
    • Weights are updated based on an approximate method called contrastive divergence (CD) approximation [31].
    • While the expectations may be calculated by running Gibbs sampling infinitely many times, in practice, one-step CD is often used because it performs well [31]. Other model parameters (e.g., the biases) can be updated similarly.
    • After pre-training, DBN adds a final layer representing the desired outputs.
    • The overall network is fine tuned using labeled data and back propagation strategies for better discrimination.
    • The final layer is called associative memory.
    • Instead of using RBMs, other variations of pre-training:
      1. Stacked denoising auto-encoders [32], [33],
      2. Stacked predictive sparse coding [34].
    • Recent results show that when a large number of training data is available, a fully supervised training using random initial weights instead of the pre-trained weights (i.e., without using RBMs or auto-encoders) will practically work well [13], [35].

    B. Convolutional neural networks:

    • CNN is composed of many layers of hierarchy with some layers [24] for
      • Feature representations (or feature maps) layers
      • Classification layers
    • The feature representations layers starts with two altering types of layers called convolutional and subsampling layers.
    • The convolutional layers perform convolution operations with several filter maps of equal size.
    • The subsampling layers reduce the sizes of proceeding layers by averaging pixels within a small neighborhood (or by max-pooling [36], [37]).
    • The value of each unit in a feature map is the result depending on a local receptive field in the previous layer and the filter.
    • The value of each unit is followed by a nonlinear activation funtion. Most recent: a scaled hyperbolic tangent function [38].
    • The key parameters to be learned: weights between layers.
    • Standard training: backpropagation using a gradient descent algorithm
    • Loss function:mean squared-error
    • Training deep CNN architectures can be unsupervised.
    • Unsupervised training of CNNs: Predictive sparse decomposition (PSD) [39] that approximate inputs with a linear combination of some basic and sparse functions.
    • Inspired by biological processes [40], CNN learns a hierarchical feature representation by utilizing strategies like:
      1. Local receptive fields: the size of each filter is normally small,
      2. Shared weights: using the same weights to construct all the feature maps at the same level to significantly reduces the number of parameters,
      3. Subsampling to further reduce the dimensionality.
    • A CNN is capable of learning good feature hierarchies automatically and providing some degree of translational and distortional invariances.

Deep learning for massive amount of data:

  • A surge in interest in effective and scalable parallel algorithms for training deep models for big data [ 12, 13, 15, 41, 42, 43,44].
  • Large-scale deep learning: large volumes of data and large models.
  • Large-scale learning algorithms:
    1. Locally connected networks [24], [39],
    2. Improved optimizers [42],
    3. New structures (Deep Stacking Network or DSN) that can be implemented in parallel [44].
  • A DSN consists of several specialized neural networks (called modules) with a single hidden layer.
  • A new deep architecture called Tensor Deep Stacking Network (T-DSN) is based on the DSN, is implemented using CPU clusters for scalable parallel computing [45].
  • One way to scale up DBNs is to use multiple CPU cores, with each core dealing with a subset of training data (data-parallel schemes).
  • Some aspects of technical details of paralellization [46]:
    • Carefully designing data layout,
    • Batching of the computation,
    • using SSE2 instructions,
    • leveraging SSE3 and SSE4 instructions for fixed-point implementation. (These implementations can enhance the performance of modern CPUs more for deep learning.)
  • Parallelize Gibbs sampling of hidden and visible units by splitting hidden units and visible units into n machines, each responsible for 1/n of the units [47].
  • Data transfer between machines is required (i.e., when sampling the hidden units, each machine will have the data for all the visible units and vice verse).
  • If both the hidden and visible units are binary and also if the sample size is modest, the commnication cost is small.
  • If large-scale data sets are used the communication cost can rise up quickly.
  • FPGA-based implementation of large-scale deep learning [48]:
    1. A control unit implemented in a CPU,
    2. A grid of multiple full-custom processing tiles
    3. A fast memory.
  • As of August 2013, NVIDIA single precision GPUs exceeded 4.5 TeraFLOP/s with a memory bandwidth of near 300 GB/s [49].
  • A typical CUDA-capable GPU consists of several multi-processors.
  • Each multi-processor (MP) consists of several streaming multiprocessors (SMs) to form a building block
  • Each SM has multiple stream processors (SPs) that share control logic and low-latency memory.
  • Each GPU has a global memory with very high bandwidth and high latency when accessed by the CPU (host).
  • GPU two levels of parallelism:
    1. Instruction (memory) level (i.e., MPs) and
    2. Thread level (SPs).
  • This SIMT (Single Instruction, Multiple Threads) architecture allows for thousands or tens of thousands of threads to be run concurrently, which is best suited for operations with large number of arithmetic operations and small access times to memory.
  • The parallelism can also be effectively utilized with special attention on the data flow when developing GPU parallel computing applications.
  • Reduce the data transfer between RAM and the GPU’s global memory [50] by transferring data with large chunks.
  • Upload as large sets of unlabeled data as possible and by storing free parameters as well as intermediate computations, all in global memory.
  • Data parallelism and learning updates can be implemented by leveraging the two levels of parallelism:
    1. Input examples can be assigned across MPs,
    2. Individual nodes can be treated in each thread (i.e., SPs).

Large-scale DBN:

  • Raina et al. [41] proposed a GPU-based framework for massively parallelizing unsupervised learning models including DBNs (stacked RBMs) and sparse coding [21].
  • Number of free parameters to be trained:
    • Hinton & Salakhutdinov [21]: 3.8 million parameters for free images
    • Ranzato and Szummer [51]: three million parameters for text processing
    • Raina et al. [41]: More than 100 million free parameters with millions of unlabeled training data.
  • Transferring data between host and GPU global memory is time consuming.
  • Minimize host-device transfers and take advantage of shared memory.
  • Store all parameters and a large chunk of training examples in global memory during training parameter to allow updates to be carried out fully inside GPUs [41].
  • Utilize MP/SP levels of parallelism.
  • Each time, select a few of the unlabeled training data in global memory to compute the updates concurrently across blocks (data parallelism).
  • Meanwhile, each component of the input example is handled by SPs.
  • When implementing the DBN learning, Gibbs sampling [52], [53] is repeated.
  • Gibbs sampling can be implemented in parallel for the GPU, where each block takes an example and each thread works on an element of the example.
  • Weight update operations can be performed in parallel using linear algebra packages for the GPU after new examples are generated.
  • 45 million parameters in a RBM and one million examples, the GPU-based implementation increases the speed of DBN learning by a factor of up to 70, compared to a dual-core CPU implementation (around 29 minutes for GPU-based implementation versus more than one day for CPU-based implementation) [41].

Large-scale CNN:

  • CNN is a type of locally connected deep learning methods.
  • Large-scale CNN learning is often implemented on GPUs with several hundred parallel processing cores.
  • CNN training involves both forward and backward propagation.
  • For parallelizing forward propagation, one or more blocks are assigned for each feature map depending on the size of maps [36].
  • Each thread in a block is devoted to a single neuron in a map.
  • Computation of each neuron includes:
    1. Convolution of shared weights (kernels) with neurons from the previous layers,
    2. Activation
    3. Summation in an SP.
    4. Store the outputs in the global memory.
  • Weights are updated by back-propagation of errors.
  • Parallelizing backward propagation can be implemented either by pulling or pushing [36].
  • Pulling error signals: the process of computing delta signals for each neuron in the previous layer by pulling the error signals from the current layer.
  • Beware of border effects problem in pulling caused by subsampling and convolution operations: the neurons in the previous layer may connect to different numbers of neurons in the previous layer [54].
  • For implementing data parallelism, one needs to consider the size of global memory and feature map size.
  • At any given stage, a limited number of training examples can be processed in parallel.
  • Within each block, where comvolution operation is performed, only a portion of a feature map can be maintained at any given time due to the extremely limited amount of shared memory.
  • For convolution operations, use limited shared memory as a circular buffer [37], which only holds a small portion of each feature map loaded from global memory each time.
  • Convolution will be performed by threads in parallel and results are written back to global memory.
  • To further overcome the GPU memory limitation, a modified architecture with both the convolution and subsampling operations being combined into one step [37].
  • To further speedup, use two GPUs for training CNNs with five convolutional layers and three fully connected classification layers [55].
  • The CNN that uses Rectified Linear Units (ReLUs) as the nonlinear function (f (x) = max(0, x)), has been shown to run several times faster than other commonly used functions [55].
  • For some layers, about half of the network is computed in a single GPU and the other portion is calculated in the other GPU; the two GPUs communicated at some other layers without using host memory.

Combination of Data and Model Paralellism:

  • DistBelief [56]: distributed training and learning in deep networks with very large models (e.g., a few billion parameters) and large-scale data sets.
  • DistBelief: large-scale clusters of machines to manage both data and model parallelism via multithreading, message passing, synchronization as well as communication between machines.
  • The model is partitioned into 169 machines, each with 16 CPU cores.
  • To deal with large-scale data with high dimensionality, deep learning often involves many densely connected layers with a large number of free parameters (i.e., large models).
  • Model parallelism:
    • Allowing users to partition large network architectures into several smaller structures (called blocks), whose nodes will be assigned to and calculated in several machines.
    • Each block will be assigned to one machine.
    • Boundary nodes (nodes whose edges belong to more than one partitions) require data transfer between machines.
    • Fully-connected networks have more boundary nodes and often demand higher communication costs than locally-connected structures.
    • As many as 144 partitions, which have been reported for large models in DistBelief, leads to significant improvement of training speed.
  • Data parallelism:
    • Employs two separate distributed optimization procedures:
      1. Downpour stochastic gradient descent (SGD)
      2. Sandblaster [56],
    • In practice, the Adagrad adaptive learning rate procedure [57] is integrated into the Downpour SGD for better performance.
    • DistBelief is implemented in two deep learning models:
      1. Fully connected network with 42 million model parameters and 1.1 billion examples,
      2. Locally-connected convolutional neural network with 16 million images of 100 by 100 pixels and 21,000 categories (as many as 1.7 billion parameters).
  • Experimental results: locally connected learning models will benefit more from DistBelief. With 81 machines and 1.7 billion parameters, the method is 12x faster than using a single machine.
  • Scale up from single machine to thousands of machines is the key to Big Data analysis.
  • Train a deep architecture with a sparse deep autoencoder, local receptive fields, pooling, and local contrast normalization [50].
  • Scale up the dataset, the model, and the resources all together.
  • Multiple cores allow for another level of parallelism where each subset of cores can perform different tasks.
  • Asynchronous SGD is implemented with several replicas of the core model and mini-batch of training examples.
  • The framework was able to train as many as 14 million images with a size of 200 by 200 pixels and more than 20 thousand categories for three days over a cluster of 1,000 machines with 16,000 cores.
  • The model is capable of learning high-level features to detect objects without using labeled data.

The COTS HPC Systems:

  • DistBelief can learn with very large models (more than one billion parameters), its training requires 16,000 CPU cores, which are not commonly available for most researchers.
  • Most recently, Coates et al. presented an alternative approach that trains comparable deep network models with more than 11 billion free parameters by using just three machines [58].
  • The Commodity Off-The-Shelf High Performance Computing (COTS HPC) system is comprised of a cluster of 16 GPU servers with Infiniband adapter for interconnects and MPI for data exchange in a cluster.
  • Each server is equipped with four NVIDIA GTX680 GPUs, each having 4GB of memory. With well-balanced number of GPUs and CPUs, COTS HPC is capable of running very large-scale deep learning.
  • Coates et al. [58] fully take advantage of matrix sparseness and local receptive field by extracting nonzero columns for neurons that share identical receptive fields, which are then multiplied by the corresponding rows.
  • This strategy successfully avoids the situation where the requested memory is larger than the shared memory of the GPU.
  • Matrix operations are performed by using a highly optimized tool called MAGMA BLAS matrix-matrix multiply kernels [59].
  • GPUs are further being utilized to implement a model parallel scheme:
    • Each GPU is only used for a different part of the model optimization with the same input examples;
    • Their communication occurs through the MVAPICH2 MPI. This very large scale deep learning system is capable of training with more than 11 billion parameters, which is the largest model reported by far, with much less machines.
  • It has been observed in several groups (see [41]) that single CPU is impractical for deep learning with a large model. With multiple machines, the running time may not be a big concern any more (see [56]).
  • Major research efforts are toward experiments with GPUs with their running times:
    1. DBN [41]: NVIDIA GTX 280 GPU, 1 GB Mem; 1 million images, 100 million parameters, ~1 day
    2. CNN [55]: 2 NVIDIA GTX 580, each 3GB Mem; 1.2 million high resolution (256×256) images, 60 million parameters; ~5-6 days
    3. Distbelief [56]: 1000 CPUs, downpour SGD, Adagrad; 1 billion audio samples, 42 million model parameters; ~16 hours
    4. Sparse autoencoder [50]: 1000 CPUs, 16,000 cores; 100 million 200×200 images, 1 billion parameters; ~3 days
    5. COTS HPC [58]: 64 NVIDIA GTX 680 GPUs, each with 4GB Mem; 100 million 200×200 images, 11 billion parameters, ~3 days

Remaining Challenges and Perspectives:

  • Traditional machine learning: data completely loaded into memory.
  • Many significant challenges posted by Big Data [63]:
    • volume: large scale of data
    • variety: different types of data
    • velocity: speed of streaming data
  • Future deep learning system should be scalable to Big Data,
  • Develop high performance computing infrastructure-based systems together with theoretically sound parallel learning algorithm or novel architecture.
  • Big Data is often incomplete resulting from their disparate origins.
  • Majority of data may not be labeled, or if labeled, there exist noisy labels.
  • Deep learning is effective in integrating data from different sources. For example, Ngiam et al. [69] developed a novel deep learning algorithms to learn representations by integrating audio and video data.
  • Deep learning is generally effective in
    1. learning single modality representations through multiple modalities with unlabeled data
    2. learning shared representations capable of capturing correlations across multiple modalities.
  • Multimodal Deep Boltzmann Machine (DBM) that fuses real-valued dense image data and text data with sparse word frequencies [70]
  • Different sources may offer conflicting information.
  • Current deep learning: mainly tested upon bi-modalities (i.e., data from two sources).
    • Will the system performance benefits from significantly enlarged modalities?
    • At what levels in deep learning architectures are appropriate for feature fusion with heterogeneous data?
  • Data are generating at extremely high speed and need to be processed in a timely manner. One solution: online learning approaches.
  • Online learning learns one instance at a time and the true label of each instance will soon be available, which can be used for refining the model [71] [72] [73] [74] [75] [76].
  • Conventional neural networks have been explored for online learning but only limited progress.
  • Instead of proceeding sequentially one example at a time, the updates can be performed on a minibatch basis [37].
  • Data are often non-stationary, i.e., data distribution is changing over time.
  • Deep online learning – online learning often scales naturally and is memory bounded, readily parallelizable, and theoretically guaranteed [98].
  • Deep learning for high variety and velocity of Big Data: transfer learning or domain adaption, where training and test data may be sampled from different distributions [99] [100] [101] [102] [103] [104] [105] [106] [107].
  • Recent domain adoption deep learning:
    • Unsupervisedly trains on a large number of unlabeled data from a set of domains, then applied it to train a classifier with few labeled examples from only one domain [100].
    • Applied deep learning of multiple level representations for transfer learning where training examples may not well represent test data [99]. more abstract features discovered by deep learning approaches are most likely generic between training and test data.

My Review:

  • This paper is very interesting state-of-the-art review on Large-scale Deep Learning for Big Data application, the challenges and future trends.
  • The state-of-the-art solutions seems to be DistBelief and COTS HPC.
  • No mention on Cloud based solutions so that average researchers can harness the high performance computation on the cloud without need to build their own cluster that will soon become out dated.
  • Before, reading this review, I hoped the author to give some advices to average researchers that can not afford high end, industrial-level computing power, on what direction is still open to explore without such machine (working on theoretical aspects would off course one of the easy answer (smile)).

Review on The First Deep Learning for Churn Prediction

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews the first reported work on deep learning for churn (the loss of customers because they move out to competitors.) prediction: Using Deep Learning to Predict Customer Churn in a Mobile Telecommunication Network.


  • In telecommunication companies, churn costs roughly $10 billion per year.
  • Acquiring new customers costs five to six times more than retaining existing ones.
  • The current focus is to move from customer acquisition towards customer retention.
  • Being able to predict customer churn in advance, provides to a company a high valuable insight in order to retain and increase their customer base. Tailored promotions can be offered to specific customers that are not satisfied.
  • Deep learning attempts to learn multiple levels of representation and automatically comes up with good features and representation for the input data.
  • To investigate and consider the application of deep learning as a predictive model to avoid time-consuming feature engineering effort and ideally to increase the predictive performance of previous models.

Addressed problem: Using deep learning for predicting churn in a prepaid mobile telecommunication network

Previous works:

  • Most advanced models make use of state-of-the-art machine learning classifiers such as random forests [6], [10].
  • Use graph processing techniques [8].
  • Predict customer churn by analyzing the interactions between the customer and the Customer Relationship Management (CRM) data [9]
  • Base their effectiveness in the feature engineering process.
  • The feature engineering process is usually time consuming and tailored only to specific datasets.
  • Machine learning classifiers work well if there is enough human effort spent in feature engineering.
  • Having the right features for each particular problem is usually the most important thing.
  • Features obtained in this human feature engineering process are usually over-specified and incomplete.

Key ideas:

  • Introduce a data representation architecture that allows efficient learning across multiple layers of detailed user behavior representations. This data representation enables the model to scale to full-sized high dimensional customer data, like the social graph of a customer.
  • The first work reporting the use of deep learning for predicting churn in a mobile telecommunication network.
  • Churn in prepaid services is actually measured based on the lack of activity
  • Infer when this lack of activity may happen in the future for each active customer.

Network architecture: A four-layer feedforward architecture.

Learning algorithms: Autoencoders, deep belief networks and multi-layer feedforward networks with different configurations.

Dataset: Billions of call records from an enterprise business intelligence system. This is a large-scale historical data from a telecommunication company with ≈1.2 million customers and span over sixteen months.


  • Churn rate is very high and all customers are prepaid users, so there is no specific date about contract termination and this action must be inferred in advance from similar behaviors.
  • There are complex underlying interactions amongst the users.


  • Churn prediction is viewed as a supervised classification problem where the behavior of previously known churners and non-churners are used to train a binary classifier.
  • During the prediction phase new users are introduced in the model and the likelihood of becoming a churner is obtained.
  • Depending on the balance replenishment events, each customer can be in one of the following states: (i) new, (ii) active, (iii) inactive or (iv) churn.
  • Customer churn is always preceded by an inactive state and since our goal is to predict churn we will use future inactive state as a proxy to predict churn. In particular, t=30 days without doing a balance replenishment event is set as the threshold used to change the state from active to inactive.

Input data preparation:

  • Two main sources of information: Call Detail Records (CDR) and balance replenishment records.
  • Each CDR provides a detailed record about each call made by the customer, having (at least) the following information:
    • Id of the cell tower where the call is originated.
    • Phone number of the customer originating the call.
    • Id of the cell tower where the call is finished.
    • Destination number of the call.
    • Time-stamp of the beginning of the call.
    • Call duration (secs).
    • Unique identification number of the phone terminal.
    • Incoming or outgoing call.
  • On the other hand, balance replenishment records has (at least) the following information:
    • Phone number related to the balance replenishment event.
    • Time-stamp of the balance replenishment event.
    • Amount of money the customer spent in the balance replenishment event.
  • Input vector preparation steps:
    • Compute one input vector per user-id for each month.
    • Input vector contains both calls events and balance replenishment history of each customer.
    • User-user adjacency matrix is extremely huge (roughly 1.2M x 1.2M entries) and not really meaningful.
    • Consider only call records for each top-3 users (often called by the user).
    • Create a 48-dimensional vector X where each position refers to the sum of total seconds each user spent on the phone in that 30 minutes time interval over the complete month.
    • We will have a 145-dimensional input vector (~ 3 x 48).
    • Add another 48-dimensional vector per user with the total amount of cash spent in each monthly slot.
    • Include 5 features specific to the business.
    • Add the binary class indicating if the user will be active in month M + 1 or not.
    • Finally, we end up with a 199-dimensional input vector X.

Deep learning model:

  • Use a standard multi-layer feedforward architecture.
  • The function tanh is used for the nonlinear activation.
  • Each layer computes an output vector using the output of the previous layer.
  • The last output layer generates the predictions by applying the softmax function.
  • Use the negative conditional log-likelihood as a loss function whose expected value over pairs (between training example and output value) is minimized.
  • Apply a standard stochastic gradient descendent (SGD) with the gradient computed via backpropagation.
  • The initial weight distribution we employ a random initialization drawn from a normal distribution with 0.8 standard deviation.
  • Use dropout as a regularization technique to prevent over-fitting while improving generalization.

Training and validation:

  • Deep neural network is trained to distinguish between active and inactive customers based on learned features associated with them.
  • In the training phase each instance contains input data for each customer together with the known state in the following month.
  • In the validation phase data from the next month is introduced into the model and the prediction errors are computed.
  • Each instance in the training and validation data may refer to different customers. Hence, the customer identification is not accounted for the predictions.
  • The model has been evaluated over 12 months of real customer data (from March 2013 to February 2014),
  • 12 models are trained and each model generates prediction for all the months.

Results: On average, the model achieves 77.9% AUC on validation data, significantly better than our prior best performance of 73.2% obtained with random forests and an extensive custom feature engineering applied to the same datasets.


  • Multi-layer feedforward models are an effective algorithm for predicting churn and capture the complex dependency in the data.
  • Experiments show that the model is quite stable along different months, thus generalize well with future instances and do not overfit the training data.
  • Its success on churn prediction can be potentially applied to other business intelligence prediction tasks like fraud detection or upsell.

Future works:

  • Include location data (latitude and longitude) of each call into the input data and hope to improve obtained results.
  • Apply deep belief networks for unsupervised pre-training improves the predictive performance [20].
  • Input data architecture may encode also long-term interactions among users for better model but the full user-user input data is extremely sparse and if we want to consider long-term user interactions it becomes very big.

My Review:

  • This is a nice and interesting article that highlights the success of deep learning to unsupervisedly extract better features for churn out prediction.
  • Even though stated in the abstract, the autoencoders and deep belief networks are not yet implemented.
  • The result using the same data is 4.7% higher than state of the art random forest technique in term of AUC. However, the computation cost overhead compared to the previous method is unclear.

Review on a Deep Learning that Make Image Speaks Naturally

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews a deep learning which make image speaks naturally: Deep Visual-Semantic Alignments for Generating Image Descriptions


  • A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene [8].
  • Still, such remarkable ability is an elusive task for our visual recognition models.

Addressed problem:

  • Build a model that generates free-form natural language descriptions of image regions.
  • Strives to take a step towards the goal of generating dense, free-form descriptions of images


  • Flickr8K [15], Flickr30K [49] and MSCOCO [30]
  • These datasets contain 8,000, 31,000 and 123,000 images respectively and each is annotated with 5 sentences using Amazon Mechanical Turk.
  • For Flickr8K and Flickr30K, 1,000 images for validation, 1,000 for testing and the rest for training (consistent with [15], 18).
  • For MSCOCO, 5,000 images for both validation and testing.

Data Preprocessing:

  • Convert all sentences to lowercase, discard non-alphanumeric characters.
  • Filter words to those that occur at least 5 times in the training set, which results in 2538, 7414, and 8791 words for Flickr8k, Flickr30K, and MSCOCO datasets respectively.

Previous works:

  • Majority of previous work in visual recognition has focused on labeling images with a fixed set of visual categories, and great progress has been achieved in these endeavors [36, 6].
  • Some pioneering approaches that address the challenge of generating image descriptions have been developed [22, 7].
  • The focus of these previous works has been on reducing complex visual scenes into a single sentence, which the authors consider as an unnecessary restriction.

Dense Image Annotation:

  • Barnard et al. [1] and Socher et al. [41] studied the multimodal correspondence between words and images to annotate segments of images.
  • [27, 12, 9] studied the problem of holistic scene understanding in which the scene type, objects and their spatial support in the image is inferred.The difference: the focus of these previous works is on correctly labeling scenes, objects and regions with a fixed set of categories, while the focus of the reviewed paper is on richer and higher-level descriptions of regions.

Generating textual description:

  • Pose the task as a retrieval problem, where the most compatible annotation in the training set is transferred to a test image [15, 42, 7, 36, 17], or where training annotations are broken up and stitched together [ 24, 28, 25].
  • Generating image captions based on fixed templates that are filled based on the content of the image [13, 23, 7, 46, 47, 4]. This approach still imposes limits on the variety of outputs, but the advantage is that the final results are more likely to be syntactically correct.
  • Instead of using a fixed template, some approaches that use a generative grammar have also been developed [35, 48].
  • Srivastava et al. [43] uses a Deep Boltzmann Machine to learn a joint distribution over a images and tags. However, they do not generate extended phrases.
  • Kiros et al. [20] developed a log-bilinear model that can generate full sentence descriptions. However, their model uses a fixed window context, while the proposed Recurrent Neural Network model can condition the probability distribution over the next word in the sentence on all previously generated words.
  • Mao et al. [31] introduced a multimodal Recurrent Neural Network architecture for generating image descriptions on the full image level, but their model is more complex and incorporates the image information in a stream of processing that is separate from the language model.

Grounding natural language in images:

  • Kong et al. [21] develop a Markov Random Field that infers correspondences from parts of sentences to objects to improve visual scene parsing in RGBD images.
  • Matuszek et al. [32] learn joint language and perception model for grounded attribute learning in a robotic setting.
  • Zitnick et al. [51] reason about sentences and their grounding in cartoon scenes.
  • Lin et al. [29] retrieve videos from a sentence description using an intermediate graph representation.
  • The basic form of the proposed model is inspired by Frome et al. [10] who associate words and images through a semantic embedding.
  • Karpathy et al. [18], who decompose images and sentences into fragments and infer their inter-modal alignment using a ranking objective.The difference:Previous model is based on grounding dependency tree relations, whereas the proposed model aligns contiguous segments of sentences which are more meaningful, interpretable, and not fixed in length.

Neural network in visual and language domains:

  • On the image side, Convolutional Neural Networks (CNNs) [26, 22] have recently emerged as a powerful class of models for image classification and object detection [38].
  • On the sentence side, the proposed work takes advantage of pretrained word vectors [34, 16,2] to obtain low-dimensional representations of words.
  • Recurrent Neural Networks have been previously used in language modeling [33, 44], but this paper additionally conditions these models on images.


  • Design of a model that is rich enough to reason simultaneously about contents of images and their representation in the domain of natural language.
  • The model should be free of assumptions about specific hard-coded templates, rules or categories and instead rely primarily on training data.
  • Datasets of image captions are available in large quantities on the internet, but these descriptions multiplex mentions of several entities whose locations in the images are unknown.

Key ideas:

  • The model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data.
  • The model is based on a novel combination of:
    • Convolutional Neural Networks over image regions
    • Bidirectional Recurrent Neural Networks over sentences,
    • Structured objective that aligns the two modalities through a multimodal embedding.
  • Closed vocabularies of visual concepts constitute a convenient modeling assumption, however, they are vastly restrictive when compared to the enormous amount of rich descriptions that a human can compose.
  • Treating the sentences as weak labels, in which contiguous segments of words correspond to some particular, but unknown location in the image.
  • Infer the alignments of word segments and use them to learn a generative model of descriptions.


  • Develop a deep neural network model that infers the latent alignment between segments of sentences and the region of the image that they describe.
  • Associates the two modalities through a common, multimodal embedding space and a structured objective.
  • Validate the effectiveness of the proposed approach on image-sentence retrieval experiments in which the proposed models surpass the state-of-the-art.
  • Introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text.
  • The generated sentences significantly outperform retrieval based baselines, and produce sensible qualitative predictions.
  • Train the model on the inferred correspondences and evaluate its performance on a new dataset of region-level annotations.
  • Code, data and annotations are publicly available.


  • The input is a set of images and their corresponding sentence descriptions.
  • Present a model that aligns segments of sentences to the visual regions that they describe through a multimodal embedding.
  • Treat these correspondences as training data for our multimodal Recurrent Neural Network model which learns to generate the descriptions.

Representing Images:

  • Sentence descriptions make frequent references to objects and their attributes. [23], 18].
  • Girshick et al. [11] detect objects in every image with a Region Convolutional Neural Network (RCNN). The CNN is pre-trained on ImageNet [3] and finetuned on the 200 classes of the ImageNet Detection Challenge [38].
  • Karpathy et al. [18], use the top 19 detected locations and the whole image and compute the representations based on the pixels inside each bounding box.
  • CNN transforms the pixels inside bounding box into 4096-dimensional activations of the fully connected layer immediately before the classifier.
  • The CNN parameters contain approximately 60 million parameters and the architecture closely follows the network of Krizhevsky et al [22].
  • The weights matrix has dimensions h × 4096, where h is the size of the multimodal embedding space (h ranges from 1000-1600 in experiments). Every image is thus represented as a set of h-dimensional vectors.

Representing Sentences:

  • Represent the words in the sentence in the same h dimensional embedding space that the image regions occupy.
  • The simplest approach: project every individual word directly into this embedding.
    Shortcomings: does not consider any ordering and word context information in the sentence.
  • Extended approach: use word bigrams, or dependency tree relations as previously proposed [18].
    Shortcomings: still imposes an arbitrary maximum size of the context window and require the use of Dependency Tree Parsers that might be trained on unrelated text corpora.
  • Use a bidirectional recurrent neural network (BRNN) [39] to compute the word representations.
  • The BRNN takes a sequence of N words (encoded in a 1-of-k representation) and transforms each one into an h-dimensional vector.
  • The representation of each word is enriched by a variably-sized context around that word.
  • The weights specify a word embedding matrix that is initialized with 300-dimensional word2vec [34].
  • the BRNN consists of two independent streams of processing, one moving left to right and the other right to left.
  • The final h-dimensional representation for the word is a function of both the word at that location and also its surrounding context in the sentence.
  • Every word representation is a function of all words in the entire sentence, but the empirical finding is that the final word representations align most strongly to the visual concept of the word at that location.

Image and Sentence Alignments:

  • Map every image and sentence into a set of vectors in a common h dimensional space.
  • Labels are at the level of entire images and sentences.
  • Formulate an image-sentence score as a function of the individual scores that measure how well a word aligns to a region of an image.
  • Intuitively, a sentence-image pair should have a high matching score if its words have a confident support in the image.
  • Karpathy et al. [18], interpreted the dot product between an image fragment and a sentence fragment as a measure of similarity and used these to define the score between the image and the sentence.
  • Every word aligns to the single best image region.
  • The objective function encourages aligned image-sentences pairs to have a higher score than misaligned pairs, by a margin.


  • Evaluate a compatibility score between all pairs of test images and sentences.
  • Report the median rank of the closest ground truth result in the list and Recall @K, which measures the fraction of times a correct item was found among the top K results.


  • Compare the proposed full model (“Our model: BRNN”) to the following baselines:
    • DeViSE [10]: a model that learns a score between words and images.
    • Karpathy et al. [18]: averaged the word and image region representations to obtain a single vector for each modality.
    • Socher et al. [42] is trained with a similar objective, but instead of averaging the word representations, they merge word vectors into a single sentence vector with a Recursive Neural Network.
    • Kiros et al. [19] who use an LSTM [14] to encode sentences, and they reported results on Flickr8K and Flickr30K. They outperform the proposed model with a more powerful CNN (OxfordNet[40]).
  • In all of these cases, the proposed full model (“Our model: BRNN”) provides consistent improvements.


  • The proposed model (RNN) can only generates a description of one input array of pixels at a fixed resolution. A more sensible approach might be to use multiple saccades around the image to identify all entities, their mutual interactions and wider context before generating a description.
  • The RNN couples the visual and language domains in the hidden representation only through additive interactions, which are known to be less expressive than more complicated multiplicative interactions [44, 14].
  • Going directly from an image-sentence dataset to region-level annotations as part of a single model that is trained end-to-end with a single objective remains an open problem.

My Review:

  • The paper is a nice reading since the results are encouraging and interesting.
  • The paper gives comprehensive summary on state-of-the-arts image to text researches.
  • Unfortunately, the paper did not mention any strategy nor direction for future research.
  • As explained in the limitations, still the proposed model does not as intelligence as human to describe an image, since it should learn from pair image-text examples provided by mechanical turk. More or less, it is similar to a condition where we were asked to describe an image but only using some given texts. The true challenges is actually when there are no texts to start with. This seems requires exponentially more complex automatic objects and relation between objects description system.