Summary of “ImageNet Classification with Deep Convolutional Neural Networks” paperwork

4 min readSep 12, 2020

Introduction

This paper demonstrates that a deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning on the ImageNet LSVRC-2012 competition under the name SuperVision. The model, a deep convolutional neural network trained on raw RGB pixel values. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three globally-connected layers with a final 1000-way softmax. It was trained on two NVIDIA GPUs for about a week. To make training faster, they used non-saturating neurons and a very efficient GPU implementation of convolutional nets. To reduce overfitting in the globally-connected layers they employed hidden-unit “dropout”.

The dataset

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

The architecture

It contains eight learned layers, five convolutional and three fully-connected

ReLU Nonlinearity choice

In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity. That’s why they decide to use the ReLU activation function which improves the speed of the training reaching a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons.

Training on multiple GPUs

It turns out that 1.2 million training examples are enough to train networks that are too big to fit on one GPU (in 2012 of course) so they worked on two GPUs each of 3GB ram with parallelism. The GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU.

Local response normalization

ReLU does not require input normalization to prevent them from saturating but the authors consider properly applying local normalization to support generalization using the following formula.

Where a^i x, y is the activity of a neuron computed by the kernel i

Response normalization reduces top-1 and top-5 error rates by 1.4% and 1.2%, respectively.

Overlapping Pooling

They did a max-pooling in order to reduce the dimensionality of each feature map. This enabled them to reduce the number of parameters and computations in the network, therefore controlling overfitting.

Techniques applied to reduce overfitting

If you don’t have much information about this I invite you to read a blog I wrote about regularization techniques to avoid overfitting here. They applied in this dataset data augmentation and dropout

Details of learning

The authors trained the model using stochastic gradient descent with a batch size of 128 examples, the momentum of 0.9, and a weight decay of 0.0005 the small amount of this parameter was really important for the model to learn. The following was the updater formula for this parameter.

This network was trained abruptly 90 cycles through the training set of 1.2 million images, which took five to six days on two GPUs as mentioned before.

Results

The results were pretty amazing since they really showed how the model they trained end up with such good error rates which are shown in the image below

Conclusion

It should be noted that this was the first job to have worked with deep convolutional neural networks and with two GPUs which is really impressive. The authors show us throughout the paper procedures they applied to their dataset and of course the amazing results they obtained owing to it teaching how well this process does its job.