Posted by Mohamad Ivan Fanany

This writing summarizes and reviews the first paper on rectified linear units (the building block for current state-of-the-art implementations of deep convolutional neural networks): Rectified Linear Units Improve Restricted Boltzmann Machines.

**Motivations**:

- Restricted Boltzmann machines (RBMs) have been used as generative models of many different types of data including labeled or unlabeled images (Hinton et al., 2006), sequences of mel-cepstral coefficients that represent speech (Mohamed & Hinton, 2010), bags of words that represent documents (Salakhutdinov & Hinton, 2009), and user ratings of movies (Salakhutdinov et al., 2007).
- In their conditional form they can be used to model highdimensional temporal sequences such as video or motion capture data (Taylor et al., 2006).
- Their most important use is as learning modules that are composed to form deep belief nets (Hinton et al., 2006).

**Key ideas**:

- Restricted Boltzmann machines were developed using binary stochastic hidden units.
- These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases.
- The learning and inference rules for these “Stepped Sigmoid Units” are unchanged. They can be approximated efficiently by noisy, rectified linear units.
- Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
- Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors.

**Learning an RBM**:

- Images composed of binary pixels can be modeled by an RBM that uses a layer of binary hidden units (feature detectors) to model the higher-order correlations between pixels.
- If there are no direct interactions between the hidden units and no direct interactions between the visible units that represent the pixels, there is a simple and efficient method called “Contrastive Divergence” to learn a good set of feature detectors from a set of training images (Hinton, 2002).
- Start with small, random weights on the symmetric connections between each pixel and each feature detector.
- Then repeatedly update each weight, using the difference between two measured, pairwise correlations between visible units and hidden units

**Gaussian Units**:

- RBMs were originally developed using binary stochastic units for both the visible and hidden layers (Hinton, 2002).
- To deal with real-valued data such as the pixel intensities in natural images, (Hinton & Salakhutdinov, 2006) replaced the binary visible units by linear units with independent Gaussian noise as first suggested by (Freund & Haussler, 1994).
- It is possible to learn the variance of the noise for each visible unit but this is difficult using binary hidden units.
- In many applications, it is much easier to first normalise each component of the data to have zero mean and unit variance and then to use noise-free reconstructions, with the variance set to 1.
- The reconstructed value of a Gaussian visible unit is then equal to its top-down input from the binary hidden units plus its bias.
- We use this type of noise-free visible unit for the models of object and face images described later.

**Rectified Linear Units**:

- To allow each unit to express more information, (Teh & Hinton, 2001) introduced binomial units which can be viewed as separate copies of a binary unit that all share the same bias and weights.
- A nice sideeffect of using weight-sharing to synthesize a new type of unit out of binary units is that the mathematics underlying learning in binary-binary RBM’s remains unchanged.
- Since all copies receive the same total input, they all have the same probability, of turning on and this only has to be computed once.
- For small probability, this acts like a Poisson unit, but as the probability approaches 1 the variance becomes small again which may not be desireable. Also, for small values of probability the growth in probability is exponential in the total input.
- A small modification to binomial units makes them far more interesting as models of real neurons and also more useful for practical applications.
- We make an infinite number of copies that all have the same learned weight vector and the same learned bias, but each copy has a different, fixed offset to the bias. If the offsets are −0.5, −1.5, −2.5, … the sum of the probabilities of the copies is extremely close to having a closed form.
- The total activity of all of the copies behaves like a noisy, integer-valued version of a smoothed rectified linear unit.
- A drawback of giving each copy a bias that differs by a fixed offset is that the logistic sigmoid function needs to be used many times to get the probabilities required for sampling an integer value correctly.
- It is possible, however, to use a fast approximation in which the sampled value of the rectified linear unit is not constrained to be an integer.
- We call a unit that uses this approximation a Noisy Rectified Linear Unit (NReLU).
- This paper shows that NReLUs work better than binary hidden units for several different tasks.
- (Jarrett et al., 2009) have explored various rectified nonlinearities in the context of convolutional networks and have found them to improve discriminative performance.
- The empirical results in this paper further support this observation.

**Intensity Equivariance**:

- NReLU’s have some interesting mathematical properties (Hahnloser et al., 2003), one of which is very useful for object recognition.
- A major consideration when designing an object recognition system is how to make the output invariant to properties of the input such as location, scale, orientation, lighting etc.
- Convolutional neural networks are often said to achieve translation invariance but in their pure form they actually achieve something quite different.
- If an object is translated in the input image, its representation in a pool of local filters that have shared weights is also translated. So if it can be represented well by a pattern of feature activities when it is in one location, it can also be represented equally well by a translated pattern of feature activities when it is another location.
- We call this translation equivariance: the representation varies in the same way as the image.
- In a deep convolutional net, translation invaraince is achieved by using subsampling to introduce a small amount of translation invariance after each layer of filters.
- Binary hidden units do not exhibit intensity equivariance, but rectified linear units do, provided they have zero biases and are noise-free.
- Scaling up all of the intensities in an image cannot change whether a zero-bias unit receives a total input above or below zero.
- So all of the “off” units remain off and the remainder all increase their activities by a factor.
- This stays true for many layers of rectified linear units. When deciding whether two face images come from the same person, we make use of this nice property of rectified linear units by basing the decision on the cosine of the angle between the activities of the feature detectors in the last hidden layer.
- The feature vectors are intensity equivariant and the cosine is intensity invariant.
- The type of intensity invariance that is important for recognition cannot be achieved by simply dividing all the pixel intensities by their sum. This would cause a big change in the activities of feature detectors that attend to the parts of a face when there is a bright spot in the background.

**Experiments**:

- The paper empirically compares NReLUs to stochastic binary hidden units on two vision tasks:
- Object recognition on the Jittered-Cluttered NORB dataset (LeCun et al., 2004),
- Face verification on the Labeled Faces in the Wild dataset (Huang et al., 2007).

- Both datasets contain complicated image variability that make them difficult tasks. Also, they both already have a number of published results for various methods, which gives a convenient basis for judging how good the results are.
- The paper uses RBMs with binary hidden units or NReLUs to generatively pre-train one or more layers of features and then discriminatively fine-tune the features using backpropagation.
- On both tasks NReLUs give better discriminative performance than binary units.

**Results on Jittered-Clutterd NORB dataset**:

- NORB is a synthetic 3D object recognition dataset that contains five classes of toys (humans, animals, cars, planes, trucks) imaged by a stereo-pair camera system from different viewpoints under different lighting conditions.
- NORB comes in several versions – the Jittered-Cluttered version has grayscale stereopair images with cluttered background and a central object which is randomly jittered in position, size, pixel intensity etc. There is also a distractor object placed in the periphery.
- For each class, there are ten different instances, five of which are in the training set and the rest in the test set. So at test time a classifier needs to recognize unseen instances of the same classes. In addition to the five object classes, there is a sixth class whose images contain none of the objects in the centre.
- NReLUs outperform binary units, both when randomly initialized and when pre-trained (5.2% less error rate without pre-training and 2.2% less error rate with pre-training).
- Pre-training helps improve the performance of both unit types. But NReLUs without pre-training are better than binary units with pretraining.
- The results for classifiers with two hidden layers. Just as for single hidden layer classifiers, NReLUs outperform binary units regardless of whether greedy pre-training is used only in the first layer, in both layers, or not at all.
- Pre-training improves the results: pre-training only the first layer and randomly initializing the second layer is better than randomly initialized both.
- Pre-training both layers gives further improvement for NReLUs but not for binary units.
- For comparison, the error rates of some other models are: multinomial regression on pixels 49.9%, Gaussian kernel SVM 43.3%, convolutional net 7.2%, convolutional net with an SVM at the top-most hidden layer 5.9%.
- The last three results are from (Bengio & LeCun, 2007). The results of the proposed models are worse than that of convolutional nets, but the proposed models
- Use heavily subsampled images,
- Convolutional nets have knowledge of image topology and approximate translation invariance hard-wired into their architecture.

**Results on Labeled Face in the Wild dataset**:

- The prediction task for the Labeled Faces in the Wild (LFW) dataset is as follows: given two face images as input, predict whether the identities of the faces are the same or different.
- The dataset contains colour faces of public figures collected from the web using a frontal-face detector. The bounding box computed by the face detector is used to approximately normalize the face’s position and scale within the image.
- Models using NReLUs seem to be more accurate, but the standard deviations are too large to draw firm conclusions.

**Summary**:

- The paper showed how to create a more powerful type of hidden unit for an RBM by tying the weights and biases of an infinite set of binary units.
- The paper then approximated these stepped sigmoid units with noisy rectified linear units and showed that they work better than binary hidden units for recognizing objects and comparing faces.
- The paper also showed that they can deal with large intensity variations much more naturally than binary units.
- Finally the paper showed that they implement mixtures of undirected linear models (Marks & Movellan, 2001) with a huge number of components using a modest number of parameters.

**My reviews**:

- I read this paper when tried to find out what is the Rectified Linear Unit (ReLU) that when applied to a new onvolutional neural networks deliver a classification performance thatsurpasseses human performance for image recognition task
- I wanted to know the meaning of the word “rectified” in the term. From reading this paper and also the following links on Wikipedia, Quora, and a writing by Alexandre Dalyac, it seems the term “rectify” means rectifing the positive saturated part of the well-known sigmoid function so the range of output of the activation function is not [0, 1] but becoming [0, infinity]. Dalyac stated that the ReLU is the building block for current state-of-the-art implementations of deep convolutional neural networks.
- What impress me when reading the paper is the comparison of the results (~15.2% error) with the Gaussian Kernel SVMs (43.3% error).
- The paper gives a lot of references toward ReLU.
- The paper shows experiments result that delivers better classification results compared to other methods (except the CNN).
- The paper also discuss the reason why better results can be achieved: the mathematical properties one of which very useful for object recognition (the intensity equivariance)