Hands-On AI Part 16: Modern Deep Neural Network Architectures for Image Classification

Published: 10/24/2017  

Last Updated: 10/24/2017

A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers

In the previous article, Deep Learning for Image Classification (Overview of Convolutional Neural Networks, we reviewed the main concepts of convolutional neural networks (CNNs), as well as the intuition behind them. In this article, we will consider several powerful deep neural network architectures, such as AlexNet*, ZFNet*, VGG*, GoogLeNet*, and ResNet*, and summarize the key contributions introduced with each architecture. The general storyline of the article is based on the blog post, Understanding CNNs Part 3.

Convolutional Neural Network Architectures

Nowadays, the key driver behind the progress in computer vision and image classification is the ImageNet* Challenge. It is a data challenge, where participants are given a large image dataset (one million+ images), and the goal is to develop an algorithm that can classify hold-out images into 1000 object categories such as dogs, cats, cars, and so on, with minimal errors.

According to the official rules, algorithms need to produce a list of at most five object categories in the descending order of confidence for each image. The quality of labeling will be evaluated based on the label that best matches the ground truth label for the image. The idea is to allow an algorithm to identify multiple objects in an image, and not be penalized if one of the objects identified was in fact present, but not included in the ground truth.

During the first year of the challenge, the participants were provided with pre-extracted image features for model training; for example, vector quantized SIFT* features suitable for a bag of words or spatial pyramid representation. However, the real disruption happened in 2012, when a team from the University of Toronto demonstrated that a deep neural network can achieve dramatically better results compared to traditional machine learning models trained on vectors made of pre-extracted features. We cover the first breakthrough architecture from 2012 and its successors up to 2015 in the sections below.

Figure 1. ImageNet* top-five classification error (%) evolution. Image is taken from the Kaiming He presentation, Deep Residual Learning for Image Recognition.


AlexNet architecture was proposed in 2012 by a group of scientists (A. Krizhevsky, I. Sutskever, and G. Hinton) from the University of Toronto. It was groundbreaking work in which the authors first used deep (at that time) convolutional neural networks with the total depth of eight layers (five convolutional and three dense).

Figure 2. AlexNet* architecture.

The network architecture consists of the following layers:

  • [Convolution + Max pooling + Normalization] x 2
  • [Convolution] x 3
  • [Max pooling]
  • [Dense] x 3

It might look a bit weird because the training process was split onto two graphics processing units (GPUs) due to high computational intensity. Such a split on a GPU requires manual division of the model into towers, which communicate between each other.

AlexNet was able to reduce the top-five error rate to 16.4 percent—by almost two times from the previous state-of-the-art! They also introduced a rectified linear unit (ReLU) activation function, which is standard in the field nowadays. Some of the other key properties of AlexNet and its learning procedure are summarized below:

  • Heavy data augmentation
  • Dropout
  • Optimization with SGD momentum (see tutorial on optimization, An overview of gradient descent optimization algorithms)
  • Manually scheduled learning rate (reduced by 10 when the accuracy plateaus)
  • Final model is an ensemble of seven CNNs
  • Trained on two Nvidia* GeForce GTX* 580 GPUs with only 3 GB of memory each


The network architecture of ZFNet proposed by M. Zeiler and R. Fergus from New York University is almost identical to AlexNet. The only significant differences are:

  • Filter size and stride in the first convolutional layer (reduced from 11 x 11 stride 4 in AlexNet to 7 x 7 stride 2 in ZFNet)
  • Number of filters in the pure convolutional layers (3, 4, 5).

Figure 3. ZFNet* architecture.

ZFNet improves the quality up to 11.4 percent in the top-five error rate. This is possible mainly due to the accurate tuning of the hyperparameters (filter size and number, batch size, learning rate, and so on). But maybe an even more important contribution of ZFNet was a great insight into CNNs. Zeiler and Fergus suggested a way of visualizing kernels, weights, and hidden representations of images, which is called DeconvNet*. It allowed for better understanding and further development of CNNs.

VGG Net*

In 2014, K. Simonyan and A. Zisserman from the University of Oxford proposed architecture called VGG. The main and remarkable idea of the structure is to keep filters as simple as possible. Thus, all the convolutions are made with a filter of size 3 and stride 1, and all the poolings are with a size 2 and stride 2. But that’s not all. On par with simplicity of the convolutional units the network has dramatically grown in depth—it has 19 layers! A crucial idea that appeared in this work for the first time was to stack convolutional layers without pooling. The intuition behind it is that such stacking still provides a large enough receptive field (for example, three stacked layers of 3 x 3 convolutions with stride 1 having the same receptive field as one 7 x 7 convolutional layer), but the number of parameters is significantly smaller than in the networks with big filters (acts as a kind of regularizer), plus there is a possibility to introduce additional nonlinearities.

Essentially, the authors demonstrated that even with the very simple building blocks, one can achieve state-of-the-art quality on ImageNet. The top-five error rate was improved up to 7.3 percent.

Figure 4. VGG* architecture. Note that the number of filters is in inverse proportionality with the spatial size of the image.


Previously, all progress was made by simplifying the filters and making the network deeper. In 2014, C. Szegedy et al. chose a completely different way and created the most complex architecture by that time, called GoogLeNet.

Figure 5. GoogLeNet* architecture. It uses an Inception module which is highlighted in green, and builds the network out of these modules.

One of the main contributions of this work is the so-called Inception module, which is shown in Figure 6. Other networks perform sequential processing of inputs layer-by-layer, while in the Inception module, the input is being processed in parallel. This allows it to speed up inference as well as minimize the total number of parameters.

Figure 6. Inception module. Note that it uses a few parallel branches that compute different features from the same input and then concatenates them.

One more interesting trick that is used in the Inception module is 1 x 1 convolutions. This might seem meaningless until one recalls the fact that the filter covers the whole depth dimension. Thus, 1 x 1 convolution is simply the way of dimensionality reduction across the feature map dimension. This kind of convolution was first introduced in Network In Network, a paper by M. Lin et al., and the exhaustive and intuitive explanation can be found in One by One [ 1 x 1 ] Convolution - counter-intuitively useful, a blog post by A. Prakash.

After all, it reduced the top-five error rate by another half percent—down to 6.7 percent.


In 2015, a group of researchers (Kaiming He et al.) from Microsoft Research Asia came up with the idea that is now considered to be one of the most important steps in the evolution of deep learning by a huge part of the community,

One of the main problems with deep networks is the vanishing gradients problem. In brief, this is a technical problem that arises during the back-propagation algorithm of gradient computation. In back propagation we use a chain rule, and if the gradient is small at the head of the network then it may become infinitely small by the time it reaches the beginning of the network, which would cause all sorts of problems, including the inability to learn anything (see The vanishing gradient problem, a blog post by R. Kapur, for details).

To overcome this issue, Kaiming He et al. suggested the following idea—let it learn residual mapping (the part which should be added to the input) instead of the mapping itself. Technically this is made by skip-connection, shown in Figure 7.

Figure 7. Principal scheme of the residual block; shortcut connection passes the input over the transformation layers and adds them up at the end. Note that identity connection does not add any additional parameters to the network, thus it does not complicate it.

This idea is extremely simple yet exceptionally effective. It solves the vanishing gradients problem by allowing the gradient to flow without any changes from the top layers to the bottom by means of identity connections. It leads to the fact that very, very deep networks can be trained.

The winning network of the ImageNet Challenge 2015 has 152 layers (authors were able to train a 1001 layers network but it gave approximately the same result, thus they abandoned it). Moreover, it reduces the top-five error rate by literally a factor of two—down to 3.6 percent. According to What I learned from competing against a ConvNet on ImageNet a study made by A. Karpathy, human performance on this task is around 5 percent. This means that ResNet is able to surpass humans, at least in this image classification task.


In this article, we explored historical and multiple state-of-the-art CNN architectures, such as AlexNet, GoogLeNet, and ResNet, and reviewed the key ideas underlying each architecture. In the next article, we’re going to apply this knowledge to the practical problem motivated by the app, and build an emotion detection classifier for images.



Prev: Overview of Convolutional Neural Networks for Image Classification Next: Emotion Recognition from an Images Baseline Model

View All Tutorials ›

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.