Hands-On AI Part 15: Overview of Convolutional Neural Networks for Image Classification

Published: 10/24/2017  

Last Updated: 10/24/2017

A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers

In this article, we will provide a comprehensive theoretical overview of the convolutional neural networks (CNNs) and explain how they could be used for image classification.

This article is a foundation for the following practical articles, where we will explain how to use CNNs for emotion recognition.

Conventional Approach: No Deep Learning

Image processing refers to the broad range of tasks for which the input is an image and the output might be either an image or a set of characteristics related to it. There are plenty of possible varieties: classification, segmentation, annotation, object detection, and so on. In this article, we explore image classification not only because it is the simplest problem but also it lies in the basis of many other tasks.

The common approach to the image classification problem consists of the following two steps:

  1. Generate meaningful features for an image.
  2. Classify the image based on its features.

The conventional pipeline uses simple models such as multilayer perceptron (MLP), support vector machine (SVM), k-nearest neighbors, and logistic regression, on top of the handcrafted features. Handcrafted features are generated using different transformations (such as grayscaling and thresholding) and descriptors, like a histogram of oriented gradients (HOG) or scale-invariant feature transforms (SIFT), and so on.

The main limitation of the conventional methods is the participation of a human expert who chooses the set and the sequence of steps to generate features.

Over time, people noticed that lots of feature generation techniques could be generalized by means of kernels (filters)—small matrices (the characteristic size is 5 x 5), which convolve with the initial image. Convolution can be thought of as a two-step iterative process (see Figure 1):

  1. Slide the same fixed kernel across the initial image.
  2. At each step, calculate the dot product between the kernel and the initial image at the current position of the kernel.

The result of the image convolution with the kernel is called a feature map.

For those who are looking for a more mathematically strict explanation, we suggest reading the corresponding chapter of the recent book, Deep Learning, by I. Goodfellow, Y. Bengio, and A. Courville.

infographic image
Figure 1. Process of convolving kernel (dark green) with the initial image (green), resulting in a feature map (yellow).

A simple example of transformation that can be done with filters is blurring (see Figure 2). Let’s take the filter of all 1s. It calculates the mean value across the neighborhood determined by filter. Here, the neighborhood is a square region, but it can be a cross or anything else. Averaging leads to the lost information about exact positions of the objects, thereby making the whole image become a blur. A similar kind of intuition can be found behind other man-made filters.

infographic image
Figure 2. Results of convolving an image of Harvard with three different kernels.

Convolutional Neural Networks

The conventional approach to image classification has a few considerable drawbacks:

  • Multistage framework as opposed to the end-to-end pipeline.
  • Filters are great generalization tools, but they are fixed matrices. How should one choose which weights to put inside a filter?

Fortunately, people came up with learnable filters, which is the core concept behind CNNs. The idea is simple: Let’s learn which filters we should apply to describe images in the best possible way for this task.

There is no unique inventor of CNNs, but one of the first appearances traced back to LeNet-5* from the paper, Gradient-based Learning Applied to Document Recognition, by Y. LeCun et al.

CNNs kill two birds with one stone: There is no need to predetermine filters, and the learning procedure becomes end-to-end. The common CNN architecture consists of the following parts:

  • Convolutional layers
  • Pooling layers
  • Dense (fully-connected) layers

Let’s take a closer look at each of them.

Convolutional Layers

The convolutional layer is the main building block of CNNs. The convolutional layer has a set of characteristic properties:

Local (sparse) connectivity. In dense layers each neuron is connected to all neurons of the previous layer (that’s why it’s called dense). In the convolutional layer each neuron is connected only to the small portion of the previous layer neurons.

infographic image
(a)      (b)
Figure 3. Example of one-dimensional neural network. (a) How neurons are connected in a typical dense network, (b) Local connectivity property inherent for the convolutional layer. Images are derived from the book, Deep Learning, by I. Goodfellow et al.

The spatial size of the region to which the neuron is connected is called filter size (filter length in the case of 1D data like time series, and width and height in the case of 2D data like images). In Figure 3b, filter size is equal to 3. Weights with which this connection is made are called filter (vector in case of 1D data and matrix for 2D). Stride is the size of the step with which we slide the filter over the data (stride is equal to 1 in Figure 3b). The idea of local connectivity is no more than the sliding kernel. Each neuron in the convolutional layer represents and implements one particular position of the kernel sliding across the initial image.

infographic image
Figure 3 (c). Two stacked 1D convolutional layers.

There is one more important notion called receptive field. It reflects how many positions in the initial signal can be seen from the current neuron. For example, the receptive field in the first layer of the network shown in Figure 3 (c) equals the filter size 3 because each neuron has a connection to only three neurons of the initial signal. But in the second layer the receptive field already equals 5, because the second layers’ neuron aggregates three neurons of the first layer, each of which has the receptive field equal to 3. Further on it grows linearly with the depth.

Parameter sharing. Recall that in the classical image processing we slid the same kernel across the whole image. Here the same idea applies. Let’s just fix only filter size number of weights for one layer and share these weights across all the neurons in the layer. It corresponds to sliding one kernel across the whole image. But then, one would argue, how can we learn something having such a small number of parameters?

infographic image
(a)    (b)
Figure 4. Dark arrows represent the same weights. (a) Shows the usual MLP, where each weight is a separate parameter, (b) Illustrates the notion of parameter sharing, where many weights refer to only one learnable parameter.

Spatial arrangement. The answer to the question is simple—let’s learn multiple filters in one layer. They will be placed in parallel to each other, therefore forming a new dimension.

Let’s slow down for a moment and consider a 2D example of a 227 x 227 RGB image to explain all the introduced concepts. Notice that we now have a three-channel input image, which in fact means that we have three input images, or 3D input.

infographic image
Figure 5. Spatial dimensions of the input image.

Let’s address the channel dimensions as the depth of the image (notice that this differs from the depth of neural networks, which is equal to the number of layers in it). The question is how to define the kernel for that case.

infographic image
Figure 6. Example of 2D kernel, which in fact is a 3D matrix extending the depth dimension. This filter convolves with the image; that is, slides over the image spatially computing dot products.

The answer is simple yet not obvious—let’s also make our kernel three-dimensional. The first two dimensions are exactly the same as before (width and height of the kernel), while the third dimension should always be equal to the depth of the input.

infographic image
Figure 7. Example of convolution spatial step. The result of a dot product between the filter and a small 5 x 5 x 3 chunk of the image (that is, 5*5*5 + 1=76 dimensional dot product + bias) is one number.

In this case, the whole 5 x 5 x 3 region of the initial image is mapped into one number, while the 3D image itself will be mapped into a feature (activation) map. A feature map is a set of neurons, each calculating its own function, taking into account two main principles discussed above: Local connectivity (each neuron is connected only to the small portion of input data) and parameter sharing (all neurons use the same filter). Ideally, this feature map is the same as we’ve seen in the conventional example—it stores the result of the convolving input image with filter.

infographic image
Figure 8. Feature map as a result of convolving kernel with image over all spatial locations.

Note that the depth of the feature map equals 1 because we used only one filter. Nothing can stop us from using more filters; for example, 6. All of them will be looking at the same input data and will act independently of each other. Let’s go one step further and stack these feature maps. All of them have equal spatial dimensions because the filters are of the same size. Therefore, these stacked feature maps can be thought of as a new 3D array of data, where depth dimension is represented by feature maps from different kernels. In this view, RGB channels of the input image are no more than three initial features maps.

infographic image
Figure 9. Multiple filters applied to an input image in parallel result in multiple activation maps.

Understanding this notion of feature maps and their stacking is crucial because once we get that we can extend the architecture of our network and stack convolutional layers on top of each other to increase the receptive field and to make the classifier richer.

infographic image
Figure 10. Stacked convolutional layers. Filter sizes and their number can vary from layer to layer.

Now we understand what the convolutional layer is. The main goal of these layers is the same as it was in the conventional approach—detect meaningful features from the image. And while these features can be very simple in the first layers (presence of vertical and horizontal lines), their abstractness grows with the depth of the network (presence of dog or cat or human).

Pooling Layers

While the convolutional layer is the main building block of the CNNs, there is one more important part which is often used—the pooling layer. There is no direct analogy with conventional image processing, but pooling can be seen just as a different type of kernel. So what is it?

infographic image
(a)    (b)
Figure 11. Examples of pooling. (a) Depicts how pooling changes the spatial (but not channel) dimensions of the data volumes, (b) Illustrates the principal scheme of the pooling operation.

Pooling filters the neighborhood region of each input pixel with some predefined aggregation function such as maximum, average, and so on. Pooling is effectively the same as convolution, but the pixel combination function is not restricted to the dot product. One more crucial difference is that pooling acts only in spatial dimension. The characteristic feature of the pooling layer is that stride usually equals the filter size (common value is 2).

There are three main goals for pooling:

  • Spatial dimensionality reduction or downsampling. This is done to reduce the number of parameters.
  • Receptive field growth. By means of pooling neurons, more steps of the input signal can accumulate in the following layers.
  • Translational invariance to the small perturbations in the pattern positions in the input signal. By means of taking aggregative statistics over the small neighborhoods of the input signal, pooling might be able to neglect small spatial moves within it.

Dense Layers

Convolutional and pooling layers serve one goal—generate features from the image. The final step is to classify the input image based on the detected features. In CNNs it is done with dense layers on top of the network. This is called the classification part. It may contain several stacked, fully-connected layers, but it usually ends up with the softmax activated layer, where the number of units is equal to the number of classes. The softmax layer outputs the probability distribution over the classes for the input object. Then, one can classify the image by choosing the most probable class.


In this article we introduced one of the most powerful classes of deep learning models—convolutional neural networks. We gave an overview of key concepts such as convolution, filter, feature map, stride, receptive field, and so on, as well as the intuition behind the CNNs.

In the next article, we will review the powerful, main CNN architectures for image recognition and highlight the key contributions of each.



Prev: Image Data Preprocessing and Augmentation Next: Modern Deep Neural Network Architectures for Image Classification

View All Tutorials ›

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.