A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers
In the previous articles of this tutorial series, we described how data can be prepared Image Data Preprocessing and Augmentation for convolutional neural networks (CNNs) and we also built a simple CNN baseline Emotion Recognition from Images Baseline Model model.
In this article, we will build an advanced CNN model for emotion recognition from images using the technique called transfer learning.
Please read the article about the baseline Emotion Recognition from Images Baseline Model model first, or refer to it while reading, because some of the parts including data investigation and metrics description won’t be reproduced here in detail.
All the code, notebooks, and other materials, including the Dockerfile*, can be found on GitHub*.
The dataset of emotional images contains 1630 images of two classes: Negative (class 0) and Positive (class 1). Some of the image examples can be seen below.
While some of the examples obviously have positive or negative emotion, others might be misclassified, even by humans. Based on the visual inspection of such cases we estimate that the maximum possible accuracy should be around 80 percent. Note that the random classifier gives around 53 percent due to a little imbalance in the classes.
For training purposes we use a hold-out sample approach and split 20 percent of the dataset for validation. The split is done with stratification, which means that the balance between classes is the same in the training and validation set. For details see the baseline Emotion Recognition from Images Baseline Model article.
How to Tackle Insufficient Data
The baseline model has shown itself to be only slightly better than the random guessing of the class. There might be many possible reasons for such behavior. We believe the main reason is that the amount of data is drastically insufficient to be able to teach the convolutional part of the network to extract meaningful feature representation from the input image.
There are many different ways of tackling the problem of insufficient data. Here are few of them:
- Resampling. The idea is to estimate the data distribution and then to sample new examples from this distribution.
- Unsupervised learning. One can find huge amounts of data of the same nature as the labeled examples in the given dataset. For instance, it can be films for video recognition or audio books for speech recognition. The next step is to use this data somehow to pretrain the model (for example, using autoencoders).
- Data augmentation. It randomly alters data examples with a predefined set of transformations. For details, see our article on preprocessing and data augmentation Emotion Recognition from Images Baseline Model.
- Transfer learning. This is an option of interest so let’s take a look at it in detail.
Transfer learning refers to the set of techniques that use models (often very large) trained on different datasets of approximately the same nature.
Comparison of traditional machine learning methods and transfer learning. Image is taken from the blog post, What is Transfer Learning? by S. Ruder.
There are three main scenarios of transfer learning usage:
- Pretrained models. One just takes the model trained by someone else and uses it in the task of interest. This is possible if the tasks are very similar.
- Feature extractor. By now we know that the model architecture can be divided into two main parts: Feature extractor, which is responsible for extracting the features out of the input data and classification head, which classifies the examples based on the crafted features. Feature extractor usually constitutes the dominant part of the model. The idea is to take the feature extractor part from the model trained on a different task, fix its weights (make them non-trainable), and build new classification parts for the task of interest on top of it. The classification part is usually not very deep and consists of a few dense layers; thus, such a model is much easier to train.
- Deep fine-tuning. This method resembles the feature extractor scenario. We do exactly the same things except that the feature extractor part is frozen. For example, one might consider taking the VGG* network as a feature extractor and freezing only the first three (out of four) convolutional blocks. In this case feature extractor might better adapt to the current problem. See Building powerful image classification models using very little data, a blog post by F. Chollet, for details.
Details on the scenarios of transfer learning usage can be found in CS231n Convolutional Neural Networks for Visual Recognition, a Stanford course on CNNs by Fei-Fei Li and in Transfer Learning - Machine Learning's Next Frontier, a blog post by S. Ruder (more comprehensive).
One may wonder what the reasons are for doing this and why it might work:
- Benefit from using big datasets. For instance, we can take a feature extractor part from the model trained on 14 million images in the ImageNet* dataset. These models are complex enough to extract very good features from the data.
- Time considerations. Training big models can take as long as weeks or even months. One can save an enormous amount of time and computational resources in this case.
- The strong hypothesis behind why it all might work: Features that are learned through one task might be useful and appropriate for another task. In other words, features have the property of invariance with regard to the task. Note that the domain of the new task should be similar to the domain of the initial task. Otherwise the feature extractor might even worsen the results.
Advanced Model Architecture
Now we know what transfer learning is. And we also know that ImageNet is a huge challenge where almost all state-of-the-art architectures were tested. Let’s take a feature extractor part from one of these networks.
Fortunately, Keras* provides us with a few pretrained (on ImageNet) models that are built inside the framework. Let’s import and use it.
In this case we use a VGG network. To take only the feature extractor part, let’s cut off the classification head (three top, dense layers) of the network by setting the “include_top” parameter to “False”. We also want to initialize our network with the weights from the network trained on ImageNet. The final option is the size of the input.
Note that the original input size in the ImageNet contest equals (224, 224, 3) and our images are (400, 500, 3). But we use convolutional layers, which means that the weights of the network are the weights of the sliding kernels in the convolutional operation. In a compartment with the parameter sharing property (discussed in our theoretical Overview of Convolutional Neural Networks for Image Classification article) it leads to the fact that the input volume might be almost arbitrary because the convolution is done by means of the sliding window, and it can slide across an image of arbitrary size. The only restriction is that the input size should be big enough that it will not collapse into one point (spatial dimensions) at some intermediate layer because otherwise it would be impossible to perform further computations.
One more trick that we use is caching. VGG is a very big network. One forward pass of all the images (1630 examples) through the feature extractor part takes about 50 seconds. But recall that the weights of the feature extractor part are fixed and forward pass always gives the same result for the same image. We can use this fact and do the forward pass through the feature extractor only once and then cache the results in the intermediate array. To do that let’s first create the ImageDataGenerator class instance for streaming files from the hard disk directly (see baseline Emotion Recognition from Images Baseline Model article for details).
The next step is to use the previously created feature extractor model in prediction mode to get the features.
That took about 50 seconds. Now we can use it for training the top classification part of the model extremely fast—about 1 second per epoch. Imagine that, otherwise, every epoch would be 50 seconds longer. By using this simple caching trick we speed up the training procedure by 50 times! Here we store all the features for all the examples in RAM just because they fit into RAM. In the case of a bigger dataset one can calculate features, write them to the hard disk, and then read them using the same generator approach.
Finally, let’s take a look at the classification part architecture:
Recall that the output of the convolutional feature extractor part is 4D tensor with dimensions (examples, height, width, and channels), and for dense classification layer accepts 2D tensor with dimensions (examples, features). One way to transform 4D features tensor is just to flatten it along the last three dimensions (we did it in the baseline model). Here we use a different method called global average pooling (GAP). Instead of flattening 4D tensors let’s take an average value across the two spatial dimensions. It fact, we take one feature map and just average all the values in it. GAP was first introduced in Network In Network, a great paper by Min Lin et al. (it’s indeed worth looking through as it introduces a few important concepts; for example, 1 x 1 convolutions). One obvious advantage that GAP has is a considerably smaller number of parameters. With GAP we have only 512 features per example, while with raw flattening we would have 15 * 12 * 512 = 92160. It might be a bit of an overkill because in this case the classification part would have around 50 million parameters! Other parts of the classification model such as dense or dropout layer are described in the baseline Emotion Recognition from Images Baseline Model article in detail.
Training Setting and Parameters
Once we have our architecture prepared with Keras, we need to configure the whole model for training using the compile method.
We use almost the same setting as for the baseline model except for the optimizer. Binary cross entropy is the loss to optimize during training, and accuracy is the additional metric to trace. Adam* is our optimizer choice. It is a kind of stochastic gradient descent algorithm with momentum and adaptive learning rate (see An overview of gradient descent optimization algorithms, a blog post by S. Ruder, for details).
Learning rate is a hyperparameter of the optimizer that one might need to tune to make the model work. Recall how the formula for the vanilla gradient descent looks:
Q is the vector of the model parameters (weights of the neural network in our case), L is the objective function, ∇ is the gradient operator (calculated via back-propagation algorithm), and alpha is a learning rate. Therefore, the gradient of the objective function is the direction of the optimization step in the parameter space, while the learning rate is its size. With an unnecessarily large learning rate it’s possible to constantly overshoot the optimal point due to the big step size. On the other hand, if the learning rate is too small the optimization would take too long, and might converge only to bad local minima instead of a global one; thus, one needs to find a tradeoff. The default setting for Adam is often a good place to start.
But in this problem the default setting for Adam does not work well. We need to reduce the initial learning rate down to 0.0001; otherwise, the training does not converge.
Finally, we can start our training for 100 epochs and save the model itself and the training history after that. %time is a Magic Ipython* command that measures the execution time of the code.
Let’s check the performance of the model during training. Here, the validation accuracy equals 73 percent (in contrast with 55 percent accuracy of the baseline model). It is much better than the baseline model.
Let’s also look at the error distribution by means of a confusion matrix. Errors are distributed almost uniformly between classes with a small bias toward false negatives (top-left cell of the confusion matrix). It can be explained by a little imbalance in the dataset toward the positive class.
The next metric we check is the receiver operating characteristic (ROC) curve and the area under curve
(AUC). For a detailed explanation of what it is see the Emotion Recognition from Images Baseline Model article.
The closer the ROC curve is to the top-left corner and the bigger the AUC, the better the classifier is. In this image it can be clearly seen that the advanced pretrained model performs better than the baseline model from scratch. The AUC of the pretrained model equals 0.82, which is a good result.
In this article we learned a powerful technique, transfer learning. We also built a convolutional neural network classifier using a pretrained VGG feature extractor. This classifier has outperformed a baseline convolutional model trained from scratch by 18 percent in accuracy and 0.25 in AUC, which is a very significant boost in quality.
|Prev: Emotion Recognition from an Images Baseline Model||Next: Music Dataset Search|
Create Applications with Powerful AI Capabilities
The Anatomy of an AI Team
Select a Deep Learning Framework
Select an AI Computing Infrastructure
Augment AI with Human Intelligence Using Amazon Mechanical Turk*
Crowdsourcing Word Selection for Image Search
Data Annotation Techniques
Set Up a Portable Experimental Environment for Deep Learning with Docker*
Image Dataset Search
Image Data Collection
Image Data Exploration
Image Data Preprocessing and Augmentation
Overview of Convolutional Neural Networks for Image Classification
Modern Deep Neural Network Architectures for Image Classification
Emotion Recognition from an Images Baseline Model
Emotion Recognition from Images Model Tuning and Hyperparameters
Music Dataset Search
Music Data Collection and Exploration
Emotion-Based Music Transformation
Deep Learning for Music Generation: Choosing a Model and Preprocessing
Deep Learning for Music Generation: Implementing the Model
TensorFlow Serving for AI API and Web App Deployment
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.