Summary of “ImageNet Classification with Deep Convolutional Neural Networks” paperwork

Introduction
This paper demonstrates that a deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning on the ImageNet LSVRC-2012 competition under the name SuperVision. The model, a deep convolutional neural network trained on raw RGB pixel values. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three globally-connected layers with a final 1000-way softmax. It was trained on two NVIDIA GPUs for about a week. To make training faster, they used non-saturating neurons and a very efficient GPU implementation of convolutional nets. To reduce overfitting in the globally-connected layers they employed hidden-unit “dropout”.
The dataset
ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.
The architecture
It contains eight learned layers, five convolutional and three fully-connected

ReLU Nonlinearity choice
In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity. That’s why they decide to use the ReLU activation function which improves the speed of the training reaching a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons.
Training on multiple GPUs
It turns out that 1.2 million training examples are enough to train networks that are too big to fit on one GPU (in 2012 of course) so they worked on two GPUs each of 3GB ram with parallelism. The GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU.
Local response normalization
ReLU does not require input normalization to prevent them from saturating but the authors consider properly applying local normalization to support generalization using the following formula.

Where a^i x, y is the activity of a neuron computed by the kernel i
Response normalization reduces top-1 and top-5 error rates by 1.4% and 1.2%, respectively.
Overlapping Pooling
They did a max-pooling in order to reduce the dimensionality of each feature map. This enabled them to reduce the number of parameters and computations in the network, therefore controlling overfitting.
Techniques applied to reduce overfitting
If you don’t have much information about this I invite you to read a blog I wrote about regularization techniques to avoid overfitting here. They applied in this dataset data augmentation and dropout
Details of learning
The authors trained the model using stochastic gradient descent with a batch size of 128 examples, the momentum of 0.9, and a weight decay of 0.0005 the small amount of this parameter was really important for the model to learn. The following was the updater formula for this parameter.

This network was trained abruptly 90 cycles through the training set of 1.2 million images, which took five to six days on two GPUs as mentioned before.
Results
The results were pretty amazing since they really showed how the model they trained end up with such good error rates which are shown in the image below

Conclusion
It should be noted that this was the first job to have worked with deep convolutional neural networks and with two GPUs which is really impressive. The authors show us throughout the paper procedures they applied to their dataset and of course the amazing results they obtained owing to it teaching how well this process does its job.
Personal notes
I reckon they did a fantastic job with the ImageNet dataset for object detection