Posted by Mohamad Ivan Fanany
This writing summarizes and reviews on a paper that try to confirm and understand why large convolutional networks demonstrated impressive classification: Visualizing and Understanding Convolutional Networks
- Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet .
- Despite this encouraging progress, no clear understanding of why they perform so well, or how they might be improved. There is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error.
- Explore why Large Convolutional Network perform so well, and how they might be improve.
- Introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. The new visualization technique:
- Reveals the input stimuli that excite individual feature maps at any layer in the model.
- Allows us to observe the evolution of features during training and to diagnose potential problems with the model.
- Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark.
- Perform an ablation study to discover the performance contribution from different model layers.
- The proposed ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
- Perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification
- Since their introduction by LeCun et al.  in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection.
- In the last 18 months, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks:
- Ciresan et al.  demonstrate state-ofthe-art performance on NORB and CIFAR-10 datasets.
- Most notably, Krizhevsky et al. show record beating performance on the Image net 2012 classification, achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%.
- Girshick et al.  have shown leading detection performance on the PASCAL VOC dataset.
- Several factors are responsible for this dramatic improvement in performance:
- The availability of much larger training sets, with millions of labeled examples;
- Powerful GPU implementations, making the training of very large models practical
- Better model regularization strategies, such as Dropout.
- Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers alternate methods must be used.
- The proposed visualization technique uses a multi-layered Deconvolutional Network (deconvnet), as proposed by Zeiler et al. , to project the feature activations back to the input pixel space.
- A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In Zeiler et al. , deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.
-  find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation.
- The proposed approach in the paper is similar to contemporary work by Simonyan et al.  who demonstrate how saliency maps can be obtained from a convnet by projecting back from the fully connected layers of the network, instead of the convolutional features that are used.
- Girshick et al.  show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. The proposed visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.
- Start with the architecture of Krizhevsky et al.  and explore different architectures, discovering ones that outperform their results on ImageNet.
- Explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a for of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by Hinton et al.  and others , 26].
- Standard fully supervised convnet models throughout the paper, as de- fined by LeCun et al.  and Krizhevsky et al. .
- Map a color 2D input image, via a series of layers, to a probability vector over different classes.
- Each layer consists of:
- Convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters;
- Passing the responses through a rectified linear function;
- Optionally, max pooling over local neighborhoods and
- Optionally a local contrast operation that normalizes the responses across feature maps.
- For more details of these operations, see  and .
- The top few layers of the network are conventional fully-connected networks and the final layer is a softmax classifier.
Datasets: ImageNet 2012 training set (1.3 million images, spread over 1000 different classes) .
- Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256×256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224×224 (corners + center with(out) horizontal flips).
- Uses a large set of labeled images, where label is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare the output and the target.
- The parameters of the network (filters in the convolutional layers, weight matrices in the fullyconnected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent.
- Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 0.01, in conjunction with a momentum term of 0.9.
- Anneal the learning rate throughout training manually when the validation error plateaus. Dropout  is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 0.01 and biases are set to 0.
- Stopped training after 70 epochs.
Training time: 12 days on a single GTX580 GPU, using an implementation based on 
- The paper explored large convolutional neural network models, trained for image classification, in a number ways.
- The paper presented a novel way to visualize the activity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers.
- The paper also shows how these visualization can be used to identify problems with the model and so obtain better results, for example improving on Krizhevsky et al. impressive ImageNet 2012 result.
- The paper demonstrated through a series of occlusion experiments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context.
- An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance.
- The ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that the model can beat the best reported results, in the latter case by a significant margin.
- The proposed convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias , although it was still within 3.2% of the best reported result, despite no tuning for the task.
My notes and review:
- This paper is interesting since it visualizes the obtained features inside deep convolution neural networks via deconvolutional network. The visualization hardly can be found in other CNN papers. Such visualization brings more understanding on why the convolutional neural networks perform so well for visual recognition tasks.
- A look into visualization of features in a fully trained model, which show the top activations in a random subset of feature maps across the validation data, projected down to pixel space using deconvolutional network, we can confirm that the features visually represents back the input.
- I am still curious whether our brain does doing convolution and deconvolution. The answer might be found by looking back into the work of Kunihiko Fukushima on neocognitron which is inspired by the model proposed by Hubel & Wiesel in 1959.