Review on The First Deep Learning that Surpasses Human-Level Performance

This writing summarizes and reviews on the first reported paper on ImageNet classification using deep learning that surpasses human-level performance: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

Motivations:

Convolutional neural networks (CNNs) [17, 16] have demonstrated recognition accuracy better than or comparable to humans in several visual recognition tasks, including recognizing traffic signs [3], faces [30, 28], and handwritten digits [3, 31].
Tremendous improvements in neural networks recognition performance, mainly due to advances in two technical directions: building more powerful models and designing effective strategies against overfitting.
Neural networks are becoming more capable of fitting training data due to:
- Increased depth [25, 29]
- Enlarged width [33, 24]
- The use of smaller strides [33, 24, 2, 25]
- New nonlinear activations [21, 20, 34, 19, 27, 9]
- Sophisticated layer designs [29, 11].
Better generalization in neural networks is achieved by
- Effective regularization techniques [12, 26, 9, 31]
- Aggressive data augmentation [16, 13, 25, 29]
- Large-scale data [4, 22].
Among these advances, the rectifier neuron [21, 8, 20, 34], e.g., Rectified Linear Unit (ReLU), is one of several keys to the recent success of deep networks [16].
Rectifier neuron expedites convergence of the training procedure [16] and leads to better solutions [21, 8, 20, 34] than conventional sigmoidlike units.
Despite the prevalence of rectifier networks, recent improvements of models [33, 24, 11, 25,29] and theoretical guidelines for training them [7, 23] have rarely focused on the properties of the rectifiers.

Key ideas:

Investigate neural networks from two aspects particularly driven by these rectifiers. propose a new generalization of ReLU, which is called Parametric Rectified Linear Unit (PReLU).
This activation function adaptively learns the parameters of the rectifiers, and improves accuracy at negligible extra computational cost.
Study the difficulty of training rectified models that are very deep (e.g., 30 weights layers)
Derive a theoretically sound initialization method, which helps with convergence of very deep models trained directly from the scratch by explicitly modeling the nonlinearity of rectifiers (ReLU/PReLU). This gives more flexibility to explore more powerful network architectures.

Datasets:

1000-class ImageNet 2012 dataset [22] which contains about 1.2 million training images, 50,000 validation images, and 100,000 test images (with no published labels).
The results are measured by top-1/top-5 error rates [22].
Only use the provided data for training. All results are evaluated on the validation set, except for the final results, which are evaluated on the test set.
The top-5 error rate is the metric officially used to rank the methods in the classification challenge [22].

Results:

On the 1000-class ImageNet 2012 dataset, the PReLU network (PReLU-net) leads to a single-model result of 5.71% top-5 error, which surpasses all existing multi-model results.
The proposed multi-model result achieves 4.94% top-5 error on the test set, which is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66% [29]).
The result surpasses for the first time the reported human-level performance (5.1% in [22]) on this visual recognition challenge.

Parametric rectifiers:

For PReLU, the coefficient of the negative part is not constant and is adaptively learned.
Replacing the parameter-free ReLU (Rectified Linear Unit) activation by a learned parametric activation unit improves classification accuracy. Concurrently, Agostinelli et al. [1] also investigated learning activation functions and showed improvement on other tasks.
PReLU introduces a very small number of extra parameters. The number of extra parameters is equal to the total number of channels, which is negligible when considering the total number of weights. So we expect no extra risk of overfitting.
The paper also considers a channel-shared variant where the coefficient is shared by all channels of one layer. This variant only introduces a single extra parameter into each layer.
PReLU can be trained using backpropagation [17] and optimized simultaneously with other layers.
The time complexity due to PReLU is negligible for both forward and backward propagation.
The paper adopts the momentum method when updating.
It is worth noticing that the paper does not use weight decay (l2 regularization) when updating. A weight decay tends to push the coefficient controlling the slope of the negative part to zero, and thus biases PReLU toward ReLU. Even without regularization, the learned coefficients rarely have a magnitude larger than 1 in the experiments.
The experiment in the paper does not constrain the range of the coefficient controling the slope of the negative part so that the activation function may be non-monotonic.

Baseline comparisons:

As a baseline, the model is trained with ReLU applied in the convolutional (conv) layers and the first two fully connected (fc) layers. The training implementation follows [10]. The top-1 and top-5 errors are 33.82% and 13.34% on ImageNet 2012, using 10-view testing.
The same architecture is then trained from scratch, with all ReLUs replaced by PReLUs. The top-1 error is reduced to 32.64%. This is a 1.2% gain over the ReLU baseline.
The result also shows that channel-wise/channel-shared PReLUs perform comparably. For the channel-shared version, PReLU only introduces 13 extra free parameters compared with the ReLU counterpart. But this small number of free parameters play critical roles as evidenced by the 1.1% gain over the baseline. This implies the importance of adaptively learning the shapes of activation functions.

Initialization of filter weights for rectifiers:

Rectifier networks are easier to train [8, 16, 34] compared with traditional sigmoid-like activation networks. But a bad initialization can still hamper the learning of a highly non-linear system. In the paper, the authors propose a robust initialization method that removes an obstacle of training extremely deep rectifier networks.
Recent deep CNNs are mostly initialized by random weights drawn from Gaussian distributions [16]. With fixed standard deviations (e.g., 0.01 in [16]), very deep models (e.g., >8 conv layers) have difficulties to converge, as reported by the VGG team [25] and also observed in the authors experiments. To address this issue, in [25] they pre-train a model with 8 conv layers to initialize deeper models. But this strategy requires more training time, and may also lead to a poorer local optimum. In [29, 18], auxiliary classifiers are added to intermediate layers to help with convergence.
Glorot and Bengio [7] proposed to adopt a properly scaled uniform distribution for initialization. This is called “Xavier” initialization in [14]. Its derivation is based on the assumption that the activations are linear. This assumption is invalid for ReLU and PReLU.
In the paper, authors derive a theoretically more sound initialization by taking ReLU/PReLU into account. In their experiments, the proposed initialization method allows for extremely deep models (e.g., 30 conv/fc layers) to converge, while the “Xavier” method [7] cannot.
The main difference between the proposed derivation and the “Xavier” initialization [ 7] is that the proposed derivation address the rectifier nonlinearities.
The studies conducted in the paper show that the readiness to investigate extremely deep, rectified models by using a more principled initialization method. But in their current experiments on ImageNet, they have not observed the benefit from training extremely deep models.
Accuracy saturation or degradation was also observed in the study of small models [10], VGG’s large models [25], and in speech recognition [7]. This is perhaps because the method of increasing depth is not appropriate, or the recognition task is not complex enough.
Though the attempts of extremely deep models have not shown benefits, the proposed initialization method paves a foundation for further study on increasing depth.

Network architecture, hardware, and training time:

The baseline architecture in the paper is the 19-layer model (A). For a better comparison, the paper also lists the VGG-19 model [25]. The baseline model A has the following modifications on VGG-19:
1. In the first layer, they use a filter size of 7×7 and a stride of 2;
2. They move the other three conv layers on the two largest feature maps (224, 112) to the smaller feature maps (56, 28, 14). The time complexity is roughly unchanged because the deeper layers have more filters;
3. They use spatial pyramid pooling (SPP) [11] before the first fc layer. The pyramid has 4 levels – the numbers of bins are 7×7, 3×3, 2×2, and 1×1, for a total of 63 bins.
No evidence that the proposed model A is a better architecture than VGG-19, though the model A has better results than VGG-19’s result reported by [25].
The model A and a reproduced VGG-19 (with SPP and the authors initialization) are comparable. The main purpose of using model A is for faster running speed. The actual running time of the conv layers on larger feature maps is slower than those on smaller feature maps, when their time complexity is the same.
In four-GPU implementation, the model A takes 2.6s per mini-batch (128), and the reproduced VGG-19 takes 3.0s, evaluated on four Nvidia K20 GPUs.
The proposed model B is a deeper version of A. It has three extra conv layers. The proposed model C is a wider (with more filters) version of B. The width substantially increases the complexity, and its time complexity is about 2.3× of B. Training A/B on four K20 GPUs, or training C on eight K40 GPUs, takes about 3-4 weeks.
The authors choose to increase the model width instead of depth, because deeper models have only diminishing improvement or even degradation on accuracy.
In recent experiments on small models [10], it has been found that aggressively increasing the depth leads to saturated or degraded accuracy.
In the VGG paper [25], the 16-layer and 19-layer models perform comparably. In the speech recognition research of [7, the deep models degrade when using more than 8 hidden layers (all being fc).
The authors conjecture that similar degradation may also happen on larger models for ImageNet. After monitored the training procedures of some extremely deep models (with 3 to 9 layers added on B in Table 3), and found both training and testing error rates degraded in the first 20 epochs (but did not run to the end due to limited time budget, so there is not yet solid evidence that these large and overly deep models will ultimately degrade). Because of the possible degradation, the authors choose not to further increase the depth of these large models.
On the other hand, the recent research [5] on small datasets suggests that the accuracy should improve from the increased number of parameters in conv layers. This number depends on the depth and width. So the authors choose to increase the width of the conv layers to obtain a higher capacity model.
While all B models are very large, no severe overfitting are observed. The authors attribute this to the aggressive data augmentation used throughout the whole training procedure,

Training:

The training algorithm mostly follows [16, 13, 2, 11, 25]. From a resized image whose shorter side is s, a 224×224 crop is randomly sampled, with the per-pixel mean subtracted. The scale is randomly jittered in the range of [256, 512], following 25]. One half of the random samples are flipped horizontally [16]. Random color altering [16] is also used.
Unlike [25] that applies scale jittering only during finetuning, the authors apply it from the beginning of training. Further, unlike [25] that initializes a deeper model using a shallower one, the authors directly train the very deep model using their initialization. Their end-to-end training may help improve accuracy, because it may avoid poorer local optima.
Other hyper-parameters that might be important are as follows.
- The weight decay is 0.0005, and momentum is 0.9.
- Dropout (50%) is used in the first two fc layers.
- The minibatch size is fixed as 128. The learning rate is 1e-2, 1e-3.

Testing:

The paper adopts the strategy of “multi-view testing on feature maps” used in the SPP-net paper [11]. This strategy is further improved using the dense sliding window method in [24,25].
The authors first apply the convolutional layers on the resized full image and obtain the last convolutional feature map. In the feature map, each 14×14 window is pooled using the SPP layer [11].
The fc layers are then applied on the pooled features to compute the scores. This is also done on the horizontally flipped images. The scores of all dense sliding windows are averaged [24,25]. They further combine the results at multiple scales as in [11].

Multi-GPU Implementations:

The paper adopts a simple variant of Krizhevsky’s method [15] for parallel training on multiple GPUs.
The paper adopts “data parallelism” [15] on the conv layers.
The GPUs are synchronized before the first fc layer. Then the forward/backward propagations of the fc layers are performed on a single GPU – this means that they do not parallelize the computation of the fc layers. The time cost of the fc layers is low, so it is not necessary to parallelize them. This leads to a simpler implementation than the “model parallelism” in [15].
Besides, model parallelism introduces some overhead due to the communication of filter responses, and is not faster than computing the fc layers on just a single GPU.
The authors implement the above algorithm on our modification of the Caffe library [14]. We do not increase the mini-batch size (128) because the accuracy may be decreased [15]. For the large models in this paper, we have observed a 3.8x speedup using 4 GPUs, and a 6.0x speedup using 8 GPUs.

Comparisons with human performance:

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well trained on the validation images to be better aware of the existence of relevant classes.
When annotating the test images, the human annotator is given a special interface, where each class title is accompanied by a row of 13 example training images. The reported human performance is estimated on a random subset of 1500 test images.
The classification result (4.94%), reported in the paper, exceeds the reported human-level performance. Up to now, the result is the first published instance of surpassing humans on this visual recognition challenge.
The analysis in [22] reveals that the two major types of human errors come from fine-grained recognition and class unawareness. The investigation in [22] suggests that algorithms can do a better job on fine-grained recognition (e.g., 120 species of dogs in the dataset).
While humans can easily recognize these objects as a bird, a dog, and a flower, it is nontrivial for most humans to tell their species. On the negative side, the algorithm still makes mistakes in cases that are not difficult for humans, especially for those requiring context understanding or high-level knowledge (e.g., the “spotlight” images).
While the algorithm produces a superior result on this particular dataset, the authors admit this does not indicate that machine vision outperforms human vision on object recognition in general.
On recognizing elementary object categories (i.e., common objects or concepts in daily lives) such as the Pascal VOC task [6], machines still have obvious errors in cases that are trivial for humans. Nevertheless, the results show the tremendous potential of machine algorithms to match human-level performance on visual recognition.

My Review:

One interesting aspect about this paper is the reported classification performance of the deep learning algorithm that surpasess human-level performance (though caution should be taken carefully).
Yet it is not easy to find, what actually drive this impressive performance: the use of PReLU? wider and deeper structure? better initialization? or better design? It would be nice if the authors resolve the improvement by each of these factors in steps, piece by piece, a kind of ablation study.
While proper initialization using PReLU allow a very deep structure to converge, whereas structure using ‘Xavier’ initialization cannot converge, the authors also stated that deeper models have only diminishing improvement or even degradation on accuracy.
It seems we still have no sound theoretical basis how the PReLU propagates the distinguishing capability all the way down from the input to the output.

March 25, 2015March 26, 2015 fananymi

Review on The First Paper on Rectified Linear Units (The Building Block for Current State-of-the-art Deep Convolutional NN)

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews the first paper on rectified linear units (the building block for current state-of-the-art implementations of deep convolutional neural networks): Rectified Linear Units Improve Restricted Boltzmann Machines.

Motivations:

Restricted Boltzmann machines (RBMs) have been used as generative models of many different types of data including labeled or unlabeled images (Hinton et al., 2006), sequences of mel-cepstral coefficients that represent speech (Mohamed & Hinton, 2010), bags of words that represent documents (Salakhutdinov & Hinton, 2009), and user ratings of movies (Salakhutdinov et al., 2007).
In their conditional form they can be used to model highdimensional temporal sequences such as video or motion capture data (Taylor et al., 2006).
Their most important use is as learning modules that are composed to form deep belief nets (Hinton et al., 2006).

Key ideas:

Restricted Boltzmann machines were developed using binary stochastic hidden units.
These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases.
The learning and inference rules for these “Stepped Sigmoid Units” are unchanged. They can be approximated efficiently by noisy, rectified linear units.
Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors.

Learning an RBM:

Images composed of binary pixels can be modeled by an RBM that uses a layer of binary hidden units (feature detectors) to model the higher-order correlations between pixels.
If there are no direct interactions between the hidden units and no direct interactions between the visible units that represent the pixels, there is a simple and efficient method called “Contrastive Divergence” to learn a good set of feature detectors from a set of training images (Hinton, 2002).
Start with small, random weights on the symmetric connections between each pixel and each feature detector.
Then repeatedly update each weight, using the difference between two measured, pairwise correlations between visible units and hidden units

Gaussian Units:

RBMs were originally developed using binary stochastic units for both the visible and hidden layers (Hinton, 2002).
To deal with real-valued data such as the pixel intensities in natural images, (Hinton & Salakhutdinov, 2006) replaced the binary visible units by linear units with independent Gaussian noise as first suggested by (Freund & Haussler, 1994).
It is possible to learn the variance of the noise for each visible unit but this is difficult using binary hidden units.
In many applications, it is much easier to first normalise each component of the data to have zero mean and unit variance and then to use noise-free reconstructions, with the variance set to 1.
The reconstructed value of a Gaussian visible unit is then equal to its top-down input from the binary hidden units plus its bias.
We use this type of noise-free visible unit for the models of object and face images described later.

Rectified Linear Units:

To allow each unit to express more information, (Teh & Hinton, 2001) introduced binomial units which can be viewed as separate copies of a binary unit that all share the same bias and weights.
A nice sideeffect of using weight-sharing to synthesize a new type of unit out of binary units is that the mathematics underlying learning in binary-binary RBM’s remains unchanged.
Since all copies receive the same total input, they all have the same probability, of turning on and this only has to be computed once.
For small probability, this acts like a Poisson unit, but as the probability approaches 1 the variance becomes small again which may not be desireable. Also, for small values of probability the growth in probability is exponential in the total input.
A small modification to binomial units makes them far more interesting as models of real neurons and also more useful for practical applications.
We make an infinite number of copies that all have the same learned weight vector and the same learned bias, but each copy has a different, fixed offset to the bias. If the offsets are −0.5, −1.5, −2.5, … the sum of the probabilities of the copies is extremely close to having a closed form.
The total activity of all of the copies behaves like a noisy, integer-valued version of a smoothed rectified linear unit.
A drawback of giving each copy a bias that differs by a fixed offset is that the logistic sigmoid function needs to be used many times to get the probabilities required for sampling an integer value correctly.
It is possible, however, to use a fast approximation in which the sampled value of the rectified linear unit is not constrained to be an integer.
We call a unit that uses this approximation a Noisy Rectified Linear Unit (NReLU).
This paper shows that NReLUs work better than binary hidden units for several different tasks.
(Jarrett et al., 2009) have explored various rectified nonlinearities in the context of convolutional networks and have found them to improve discriminative performance.
The empirical results in this paper further support this observation.

Intensity Equivariance:

NReLU’s have some interesting mathematical properties (Hahnloser et al., 2003), one of which is very useful for object recognition.
A major consideration when designing an object recognition system is how to make the output invariant to properties of the input such as location, scale, orientation, lighting etc.
Convolutional neural networks are often said to achieve translation invariance but in their pure form they actually achieve something quite different.
If an object is translated in the input image, its representation in a pool of local filters that have shared weights is also translated. So if it can be represented well by a pattern of feature activities when it is in one location, it can also be represented equally well by a translated pattern of feature activities when it is another location.
We call this translation equivariance: the representation varies in the same way as the image.
In a deep convolutional net, translation invaraince is achieved by using subsampling to introduce a small amount of translation invariance after each layer of filters.
Binary hidden units do not exhibit intensity equivariance, but rectified linear units do, provided they have zero biases and are noise-free.
Scaling up all of the intensities in an image cannot change whether a zero-bias unit receives a total input above or below zero.
So all of the “off” units remain off and the remainder all increase their activities by a factor.
This stays true for many layers of rectified linear units. When deciding whether two face images come from the same person, we make use of this nice property of rectified linear units by basing the decision on the cosine of the angle between the activities of the feature detectors in the last hidden layer.
The feature vectors are intensity equivariant and the cosine is intensity invariant.
The type of intensity invariance that is important for recognition cannot be achieved by simply dividing all the pixel intensities by their sum. This would cause a big change in the activities of feature detectors that attend to the parts of a face when there is a bright spot in the background.

Experiments:

The paper empirically compares NReLUs to stochastic binary hidden units on two vision tasks:
1. Object recognition on the Jittered-Cluttered NORB dataset (LeCun et al., 2004),
2. Face verification on the Labeled Faces in the Wild dataset (Huang et al., 2007).
Both datasets contain complicated image variability that make them difficult tasks. Also, they both already have a number of published results for various methods, which gives a convenient basis for judging how good the results are.
The paper uses RBMs with binary hidden units or NReLUs to generatively pre-train one or more layers of features and then discriminatively fine-tune the features using backpropagation.
On both tasks NReLUs give better discriminative performance than binary units.

Results on Jittered-Clutterd NORB dataset:

NORB is a synthetic 3D object recognition dataset that contains five classes of toys (humans, animals, cars, planes, trucks) imaged by a stereo-pair camera system from different viewpoints under different lighting conditions.
NORB comes in several versions – the Jittered-Cluttered version has grayscale stereopair images with cluttered background and a central object which is randomly jittered in position, size, pixel intensity etc. There is also a distractor object placed in the periphery.
For each class, there are ten different instances, five of which are in the training set and the rest in the test set. So at test time a classifier needs to recognize unseen instances of the same classes. In addition to the five object classes, there is a sixth class whose images contain none of the objects in the centre.
NReLUs outperform binary units, both when randomly initialized and when pre-trained (5.2% less error rate without pre-training and 2.2% less error rate with pre-training).
Pre-training helps improve the performance of both unit types. But NReLUs without pre-training are better than binary units with pretraining.
The results for classifiers with two hidden layers. Just as for single hidden layer classifiers, NReLUs outperform binary units regardless of whether greedy pre-training is used only in the first layer, in both layers, or not at all.
Pre-training improves the results: pre-training only the first layer and randomly initializing the second layer is better than randomly initialized both.
Pre-training both layers gives further improvement for NReLUs but not for binary units.
For comparison, the error rates of some other models are: multinomial regression on pixels 49.9%, Gaussian kernel SVM 43.3%, convolutional net 7.2%, convolutional net with an SVM at the top-most hidden layer 5.9%.
The last three results are from (Bengio & LeCun, 2007). The results of the proposed models are worse than that of convolutional nets, but the proposed models
1. Use heavily subsampled images,
2. Convolutional nets have knowledge of image topology and approximate translation invariance hard-wired into their architecture.

Results on Labeled Face in the Wild dataset:

The prediction task for the Labeled Faces in the Wild (LFW) dataset is as follows: given two face images as input, predict whether the identities of the faces are the same or different.
The dataset contains colour faces of public figures collected from the web using a frontal-face detector. The bounding box computed by the face detector is used to approximately normalize the face’s position and scale within the image.
Models using NReLUs seem to be more accurate, but the standard deviations are too large to draw firm conclusions.

Summary:

The paper showed how to create a more powerful type of hidden unit for an RBM by tying the weights and biases of an infinite set of binary units.
The paper then approximated these stepped sigmoid units with noisy rectified linear units and showed that they work better than binary hidden units for recognizing objects and comparing faces.
The paper also showed that they can deal with large intensity variations much more naturally than binary units.
Finally the paper showed that they implement mixtures of undirected linear models (Marks & Movellan, 2001) with a huge number of components using a modest number of parameters.

My reviews:

I read this paper when tried to find out what is the Rectified Linear Unit (ReLU) that when applied to a new onvolutional neural networks deliver a classification performance thatsurpasseses human performance for image recognition task
I wanted to know the meaning of the word “rectified” in the term. From reading this paper and also the following links on Wikipedia, Quora, and a writing by Alexandre Dalyac, it seems the term “rectify” means rectifing the positive saturated part of the well-known sigmoid function so the range of output of the activation function is not [0, 1] but becoming [0, infinity]. Dalyac stated that the ReLU is the building block for current state-of-the-art implementations of deep convolutional neural networks.
What impress me when reading the paper is the comparison of the results (~15.2% error) with the Gaussian Kernel SVMs (43.3% error).
The paper gives a lot of references toward ReLU.
The paper shows experiments result that delivers better classification results compared to other methods (except the CNN).
The paper also discuss the reason why better results can be achieved: the mathematical properties one of which very useful for object recognition (the intensity equivariance)

March 25, 2015March 26, 2015 fananymi

Review on The Paper which Reveals The Power of Convolutional Neural Net

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews on a paper that try to confirm and understand why large convolutional networks demonstrated impressive classification: Visualizing and Understanding Convolutional Networks

Motivations:

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet [18].
Despite this encouraging progress, no clear understanding of why they perform so well, or how they might be improved. There is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error.

Addressed problem:

Explore why Large Convolutional Network perform so well, and how they might be improve.

Novelty:

Introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. The new visualization technique:
- Reveals the input stimuli that excite individual feature maps at any layer in the model.
- Allows us to observe the evolution of features during training and to diagnose potential problems with the model.
- Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark.
Perform an ablation study to discover the performance contribution from different model layers.
The proposed ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification

Previous works:

Since their introduction by LeCun et al. [20] in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection.
In the last 18 months, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks:
- Ciresan et al. [4] demonstrate state-ofthe-art performance on NORB and CIFAR-10 datasets.
- Most notably, Krizhevsky et al.[18] show record beating performance on the Image net 2012 classification, achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%.
- Girshick et al. [10] have shown leading detection performance on the PASCAL VOC dataset.
Several factors are responsible for this dramatic improvement in performance:
1. The availability of much larger training sets, with millions of labeled examples;
2. Powerful GPU implementations, making the training of very large models practical
3. Better model regularization strategies, such as Dropout.

Key Ideas:

Visualization:
- Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers alternate methods must be used.
- The proposed visualization technique uses a multi-layered Deconvolutional Network (deconvnet), as proposed by Zeiler et al. [29], to project the feature activations back to the input pixel space.
- A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In Zeiler et al. [29], deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.
- [8] find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation.
- The proposed approach in the paper is similar to contemporary work by Simonyan et al. [23] who demonstrate how saliency maps can be obtained from a convnet by projecting back from the fully connected layers of the network, instead of the convolutional features that are used.
- Girshick et al. [10] show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. The proposed visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.
Generalization:
- The demonstration of the generalization ability of convnet features is also explored in concurrent work by Donahue et al. [7] and Girshick et al. [10].

Analysis steps:

Start with the architecture of Krizhevsky et al. [18] and explore different architectures, discovering ones that outperform their results on ImageNet.
Explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a for of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by Hinton et al. [13] and others [1], 26].

Network architecture:

Standard fully supervised convnet models throughout the paper, as de- fined by LeCun et al. [20] and Krizhevsky et al. [18].
Map a color 2D input image, via a series of layers, to a probability vector over different classes.
Each layer consists of:
Convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters;
Passing the responses through a rectified linear function;
Optionally, max pooling over local neighborhoods and
Optionally a local contrast operation that normalizes the responses across feature maps.
For more details of these operations, see [18] and [16].
The top few layers of the network are conventional fully-connected networks and the final layer is a softmax classifier.

Datasets: ImageNet 2012 training set (1.3 million images, spread over 1000 different classes) [6].

Data preparation:

Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256×256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224×224 (corners + center with(out) horizontal flips).

Learning algorithm:

Uses a large set of labeled images, where label is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare the output and the target.
The parameters of the network (filters in the convolutional layers, weight matrices in the fullyconnected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent.
Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 0.01, in conjunction with a momentum term of 0.9.
Anneal the learning rate throughout training manually when the validation error plateaus. Dropout [14] is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 0.01 and biases are set to 0.
Stopped training after 70 epochs.

Training time: 12 days on a single GTX580 GPU, using an implementation based on [18]

Insights:

The paper explored large convolutional neural network models, trained for image classification, in a number ways.
The paper presented a novel way to visualize the activity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers.
The paper also shows how these visualization can be used to identify problems with the model and so obtain better results, for example improving on Krizhevsky et al. impressive ImageNet 2012 result.
The paper demonstrated through a series of occlusion experiments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context.
An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance.
The ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that the model can beat the best reported results, in the latter case by a significant margin.
The proposed convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias [25], although it was still within 3.2% of the best reported result, despite no tuning for the task.

My notes and review:

This paper is interesting since it visualizes the obtained features inside deep convolution neural networks via deconvolutional network. The visualization hardly can be found in other CNN papers. Such visualization brings more understanding on why the convolutional neural networks perform so well for visual recognition tasks.
A look into visualization of features in a fully trained model, which show the top activations in a random subset of feature maps across the validation data, projected down to pixel space using deconvolutional network, we can confirm that the features visually represents back the input.
I am still curious whether our brain does doing convolution and deconvolution. The answer might be found by looking back into the work of Kunihiko Fukushima on neocognitron which is inspired by the model proposed by Hubel & Wiesel in 1959.

March 25, 2015March 26, 2015 fananymi

Review on The Most Intriguing Paper on Deep Learning

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews the most intriguing paper on deep learning: Intriguing properties of neural networks.

Motivations:

Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks.
Their expressiveness is the reason they succeed but also causes them to learn uninterpretable solutions that could have counter-intuitive properties.

Addressed problem:

Report two counter-intuitive properties of deep learning neural networks.
The first property is concerned with the semantic meaning of individual units.
The second property is concerned with the stability of neural networks with respect to small perturbations to their inputs.

Dataset: MNIST, ImageNet (AlexNet), 10M images sampled from Youtube (QuocNet).

Previous works:

Previous works analyzed the semantic meaning of various units by finding the set of inputs that maximally activate a given unit.
The inspection of individual units makes the implicit assumption that the units of the last feature layer form a distinguished basis which is particularly useful for extracting semantic information.
Previous works considers a state-of-the-art deep neural network that generalizes well on an object recognition task can be expected to be robust to small perturbations of its input, because small perturbation cannot change the object category of an image.
Traditional computer vision systems rely on feature extraction: often a single feature is easily interpretable.
Previous works also interpret an activation of a hidden unit as a meaningful feature. They look for input images which maximize the activation value of this single feature [6, 13, 7, 4].

Inspiration from previous works:

Hard-negative mining, in computer vision, consists of identifying training set examples (or portions thereof) which are given low probabilities by the model, but which should be high probability instead [5].
A variety of recent state of the art computer vision models employ input deformations during training for increasing the robustness and convergence speed of the models [9, 13].

Key Ideas:

It is the entire space of activations, rather than the individual units, that contains the bulk of the semantic information.
By applying an imperceptible non-random perturbation to a test image, it is possible to arbitrarily change the network’s prediction.
These perturbations are found by optimizing the input to maximize the prediction error. The perturbed examples are termed as “adversarial examples”.
If we use one neural to generate a set of adversarial examples, we find that these examples are still statistically hard for another neural network even when it was trained with different hyperparemeters or, most surprisingly, when it was trained on a different set of examples.
The paper proposes a scheme to make input deformation process adaptive in a way that exploits the model and its deficiencies in modeling the local space around the training data.

Findings:

No distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis.
It is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.
Deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend.
We can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network’s prediction error.
The specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.
Deep neural networks that are learned by backpropagation have nonintuitive characteristics and intrinsic blind spots, whose structure is connected to the data distribution in a non-obvious way.

Proves that the individual units has no semantic meaning:

Base on experiments using convolutional neural networks trained on MNIST and AlexNet.
The experiments put into question the notion that neural networks disentangle variation factors across coordinates.
The results shows that the natural basis is not better than a random basis in for inspecting the properties of a last layer output unit.
The paper visually compared images that maximize the activations in the natural basis and images that maximize the activation in random directions. In both cases the resulting images share many high-level similarities.
The compared images appear to be semantically meaningful for both the single unit and the combination of units.

Reasonings that Deep NN is not stable to small perturbation on its input:

Unit-level inspection methods had relatively little utility beyond confirming certain intuitions regarding the complexity of the representations learned by a deep neural network.
Global, network level inspection methods can be useful in the context of explaining classification decision made by the model.
The output layer unit of a neural network is a highly nonlinear function of its input.
When the output layer unit is trained with the cross-entropy loss (using the softmax activation function), it represents a conditional distribution of the label given the input (and the training set presented so far).
It has been argued [2] that the deep stack of non-linear layers in between the input and the output unit of a neural network are a way for the model to encode a non-local generalization prior over the input space. In other words, it is possible for the output unit to assign non-significant probabilities to regions of the input space that contain no training examples in their vicinity.
Such regions can represent, for instance, the same objects from different viewpoints, which are relatively far (in pixel space), but which share nonetheless both the label and the statistical structure of the original inputs.
It is implicit in such arguments that local generalization—in the very proximity of the training examples—works as expected.
This kind of smoothness prior is typically valid for computer vision problems, where imperceptibly tiny perturbations of a given image do not normally change the underlying class.
Based on the some experiments in the paper, however, the smoothness assumption that underlies many kernel methods does not hold.
Using a simple optimization procedure, the authors are able to find adversarial examples, which are obtained by imperceptibly small perturbations to a correctly classified input image, so that it is no longer classified correctly. This can never occur with smooth classifiers by their definition.
We found a way to traverse the manifold represented by the network in an efficient way (by optimization) and finding adversarial examples in the input space.
The adversarial examples represent low-probability (high-dimensional) “pockets” in the manifold, which are hard to efficiently find by simply randomly sampling the input around a given example.

Conclusions:

Deep neural networks have counter-intuitive properties both with respect to the semantic meaning of individual units and with respect to their discontinuities.
The existence of the adversarial negatives appears to be in contradiction with the network’s ability to achieve high generalization performance.
Indeed, if the network can generalize well, how can it be confused by these adversarial negatives, which are indistinguishable from the regular examples?
The explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the rational numbers, and so it is found near every virtually every test case.

My notes and review:

The formal description of how to generate adversarial examples is given.
Spectral analysis on stability of deep NN is also given.
This paper is very enlighting for two reasons: (1) Two images that we see as similar are actually can be interpreted as totally different images (objects), and vice versa, two images that we see as different are actually can be interpretated as the same; (2) The deep NN still does not see as human sees. It seems human vision is till more robust and error tolerant. What actually makes us better than deep NN in this respect?
Even though it is stated that such adversarial images in reality are rarely observed, it is challenging to propose algorithms that can effectively handle the adversarial examples.

March 25, 2015March 26, 2015 fananymi

Review on Famous Google’s Deep Learning Paper

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews the famous Google’s deep learning paper: Building high-level features using large scale unsupervised learning.

Addressed problem: Building high-level, class-specific feature detectors from only unlabeled data

Network structure (architecture): A 9-layered locally connected sparse autoencoder with local receptive fileds, pooling, and local contrast normalization on a large dataset of images (the model has 1 billion connections)

Dataset: 200×200 pixel images sampled from 10 million YouTube videos. To avoid duplicates, each video contributes only one image to the dataset. The pixel size much larger than typical 32×32 images often used in other deep learning and unsupervised learning.

Training method: Model parallelism (a deep autoencoder with pooling and local contrast normalization) and asynchronous SGD (Stochastic Gradient Descent).

Hardware: A large computer cluster with 1,000 machines (16,000 cores)

Training time: Three days

Findings:

It is possible to train a face detector without having to label images as containing a face or not
This feature detector is robust not only to translation but also to scaling and out-of-plane rotation
The same network is sensitive to other high-level concepts such as cat faces and human bodies

Results:

15.8% accuracy in recognizing 20,000 object categories from ImageNet
A leap of 70% relative improvement over the previous state-of-the-art

Inspiration from previous works:

The neuroscientific conjecture that there exist highly class-specific neurons in the human brain, generally and informally known as “grand-mother neurons.”
The style of stacking a series of uniform moduls, switching between selectivity and tolerance layers is inspired by Neocognitron and HMAX. Such style is argued to be an architecture employed by the brain.
The learning of parameters in the second layer which uses sparsity and reconstruction terms is also known as reconstruction Topographic Independent Component Analysis (Hyvarinen et al., 2009; Le et al., 2011a. The first term ensures the representations encode important information about the data, e.e., to reconstruct the input data. The second term encourages pooling features to group similar features together to achieve invariances.

Current neuroscientific status:

The extent of class-specificity of neurons in the brain is an area of active investigation
Current experimental evidence suggests the possibility that some neurons in the temporal cortex are highly selective for object categories such as faces or hands (Desimone et al., 1984), and perhaps even specific people (Quiroga et al., 2005).

Motivation:

Contemporary computer vision methodology typically emphasizes the role of labeled data to obtain class-specific feature detectors.
The need for large labeled sets poses a significant challenge for problems where labeled data are rare
Approaches that make use of inexpensive unlabeled data are often preferred, however, they have not been shown to work well for building high-level features.

Key ideas:

Investigates the feasibility of building high-level features from only unlabeled data.
Inexpensive way to develop features from unlabeled data.
Answers an intriguing question as to whether the specificity of the “grandmother neuron” could possibly be learned from unlabeled data.
The paralellism (parameters are distributed across the machines) on the computer cluster use the idea of local receptive field (each feature in the autoencoder can connect only to a small region of the lower layer) to reduce communication costs between machines.
Invariance to local deformations is achieved by employing local L2 pooling (Hyvarinen et al., 2009; Le et al., 2010) and local contrast normalization (Jarrett et al., 2009).

Previous works:

(Raina et al., 2007) Self-taught learning framework: Using unlabeled data in the wild to learn features.
Previous successful features learning algorithms:
- RBMs (Hinton et al., 2006)
- Autoencoders (Hinton & Salakhutdinov, 2006, Bengio et al., 2007)
- Sparse coding (Lee et al., 2007)
- K-means (Coates et al., 2011)
To yield good results needs a lot of time to train deep learning (Ciresan et al., 2010)

Key difference to previous works:

Previous works have only succeeded in learning low-level features such as “edge” or “blob”. The paper goes beyond such simple features and captures complec invariances.
In previous works, reducing the time to train the networks (for practical reasons) undermines the learning of high-level features. The paper proposes a way to scale up the dataset, the model, and computational resouces.
Although also using the local receptive fields, unlike previous works, the receptive fields are not convolutional: the parameters are not shared across different locations in the image.
In term of scale, the network with 1 billion trainable parameters is perhaps the largest known network to date. Previous works only up to 10 million parameters. Human visual cortex is actually 1 milllion times larger.

Discovery:

It is possible to build high-level features from unlabeled data for classification and visualization.
A feature that is highly selective for faces. This is validated by visualization via numerical optimization.
The learned detector is invariant to translation and to out-of-plan rotation and scaling.
Network also learns the concepts of cat faces and human bodies.

My review:

This is an interesting and inspiring paper which pushes the advances of deep learning and unsupervised feature learning to address the very problem that intrigue many scientist about how human “see”
It seems difficult to replicate the experiments due to the high end resources requirement. Can it be proved to also work in simpler and smaller network? How big or small is enough for deep learning?
I wonder how the authors of the paper will address the more recet paper on Intriguing properties of neural networks discussed here.

Deep Learning for Big Data

Deep Learning and Big Data Summaries

Review on The First Deep Learning that Surpasses Human-Level Performance

Review on The First Paper on Rectified Linear Units (The Building Block for Current State-of-the-art Deep Convolutional NN)

Review on The Paper which Reveals The Power of Convolutional Neural Net

Review on The Most Intriguing Paper on Deep Learning

Review on Famous Google’s Deep Learning Paper