Review on Famous Google’s Deep Learning Paper

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews the famous Google’s deep learning paper: Building high-level features using large scale unsupervised learning.

Addressed problem: Building high-level, class-specific feature detectors from only unlabeled data

Network structure (architecture): A 9-layered locally connected sparse autoencoder with local receptive fileds, pooling, and local contrast normalization on a large dataset of images (the model has 1 billion connections)

Dataset: 200×200 pixel images sampled from 10 million YouTube videos. To avoid duplicates, each video contributes only one image to the dataset. The pixel size much larger than typical 32×32 images often used in other deep learning and unsupervised learning.

Training method: Model parallelism (a deep autoencoder with pooling and local contrast normalization) and asynchronous SGD (Stochastic Gradient Descent).

Hardware: A large computer cluster with 1,000 machines (16,000 cores)

Training time: Three days


  • It is possible to train a face detector without having to label images as containing a face or not
  • This feature detector is robust not only to translation but also to scaling and out-of-plane rotation
  • The same network is sensitive to other high-level concepts such as cat faces and human bodies


  • 15.8% accuracy in recognizing 20,000 object categories from ImageNet
  • A leap of 70% relative improvement over the previous state-of-the-art

Inspiration from previous works:

  • The neuroscientific conjecture that there exist highly class-specific neurons in the human brain, generally and informally known as “grand-mother neurons.”
  • The style of stacking a series of uniform moduls, switching between selectivity and tolerance layers is inspired by Neocognitron and HMAX. Such style is argued to be an architecture employed by the brain.
  • The learning of parameters in the second layer which uses sparsity and reconstruction terms is also known as reconstruction Topographic Independent Component Analysis (Hyvarinen et al., 2009; Le et al., 2011a. The first term ensures the representations encode important information about the data, e.e., to reconstruct the input data. The second term encourages pooling features to group similar features together to achieve invariances.

Current neuroscientific status:

  • The extent of class-specificity of neurons in the brain is an area of active investigation
  • Current experimental evidence suggests the possibility that some neurons in the temporal cortex are highly selective for object categories such as faces or hands (Desimone et al., 1984), and perhaps even specific people (Quiroga et al., 2005).


  • Contemporary computer vision methodology typically emphasizes the role of labeled data to obtain class-specific feature detectors.
  • The need for large labeled sets poses a significant challenge for problems where labeled data are rare
  • Approaches that make use of inexpensive unlabeled data are often preferred, however, they have not been shown to work well for building high-level features.

Key ideas:

  • Investigates the feasibility of building high-level features from only unlabeled data.
  • Inexpensive way to develop features from unlabeled data.
  • Answers an intriguing question as to whether the specificity of the “grandmother neuron” could possibly be learned from unlabeled data.
  • The paralellism (parameters are distributed across the machines) on the computer cluster use the idea of local receptive field (each feature in the autoencoder can connect only to a small region of the lower layer) to reduce communication costs between machines.
  • Invariance to local deformations is achieved by employing local L2 pooling (Hyvarinen et al., 2009; Le et al., 2010) and local contrast normalization (Jarrett et al., 2009).

Previous works:

Key difference to previous works:

  • Previous works have only succeeded in learning low-level features such as “edge” or “blob”. The paper goes beyond such simple features and captures complec invariances.
  • In previous works, reducing the time to train the networks (for practical reasons) undermines the learning of high-level features. The paper proposes a way to scale up the dataset, the model, and computational resouces.
  • Although also using the local receptive fields, unlike previous works, the receptive fields are not convolutional: the parameters are not shared across different locations in the image.
  • In term of scale, the network with 1 billion trainable parameters is perhaps the largest known network to date. Previous works only up to 10 million parameters. Human visual cortex is actually 1 milllion times larger.


  • It is possible to build high-level features from unlabeled data for classification and visualization.
  • A feature that is highly selective for faces. This is validated by visualization via numerical optimization.
  • The learned detector is invariant to translation and to out-of-plan rotation and scaling.
  • Network also learns the concepts of cat faces and human bodies.

My review:

  • This is an interesting and inspiring paper which pushes the advances of deep learning and unsupervised feature learning to address the very problem that intrigue many scientist about how human “see”
  • It seems difficult to replicate the experiments due to the high end resources requirement. Can it be proved to also work in simpler and smaller network? How big or small is enough for deep learning?
  • I wonder how the authors of the paper will address the more recet paper on Intriguing properties of neural networks discussed here.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s