# CIFAR-10 Classification using Intel® Optimization for TensorFlow*

Published: 12/13/2017

Last Updated: 05/29/2018

## Abstract

This work demonstrates the experiments conducted to train and test the deep learning AlexNet* topology with the Intel® optimized TensorFlow* library using CIFAR-10 classification data on Intel® Xeon® scalable processor powered machines. These experiments were conducted with options set at run time. From these runs, the training accuracy, validation accuracy, and testing accuracy numbers were captured for different environment configurations to identify the optimal combination of environment configurations. For the optimal combination identified, the top-1 and top-5 accuracies were plotted.

## Document Content

### Environment Setup

The following hardware and software environments were used to perform the experiments.

#### Hardware

Table 1. Intel® Xeon® Gold 6128 configuration

 Architecture x86_64 CPU op-mode(s) 32 bit, 64 bit Byte order Little endian CPU(s) 24 Core(s) per socket 6 Socket(s) 2 Thread(s) per core 2 CPU family 6 Model 85 Model name Intel® Xeon® Gold 6128 CPU @ 3.40 GHz RAM 92 GB

#### Software

Table 2. On Intel Xeon Gold processor

 TensorFlow* 1.4.0 (Intel optimized) Python* 3.6.3 (Intel distributed)

The Intel Optimized TensorFlow wheel was installed through pip.

pip install https://anaconda.org/intel/tensorflow/1.4.0/download/tensorflow-1.4.0-cp36-cp36m-linux_x86_64.whl

### Network Topology and Model Training

This section details the dataset adopted, AlexNet architecture, and training the model in the current work.

#### Dataset

The CIFAR-10 dataset chosen for these experiments consists of 60,000 32 x 32 color images in 10 classes. Each class has 6,000 images. The 10 classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

The dataset was taken from Kaggle*3. The following figure shows a sample set of images for each classification.

Figure 1. CIFAR-10 sample images

For the experiments, out of the 60,000 images, 50,000 images were chosen for training and 10,000 images for testing.

#### AlexNet* Architecture

The AlexNet network is made of five convolution layers, max-pooling layers, dropout layers, and three fully connected layers. The network was designed to be used for classification with 1,000 possible categories.

Figure 2. AlexNet architecture (credit: MIT2).

#### Model Training

In these experiments, it was decided to train the model from the beginning using the CIFAR-10 dataset. The dataset is split as 50,000 images for training and validation and 10,000 images for testing.

### Experimental Runs

The experiment was conducted in two steps.

#### Step 1

Training was done with different combinations of environment options. The batch size and number of epochs were fixed in this step. The following table details the options chosen and the results for each run. In addition to the following settings, the Intel® Xeon® processors were made numactl-aware.

Step Environment Options Epochs Batch Size Training Accuracy
1 "OMP_NUM_THREADS" = "6" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 6  100 2048 55.20%
2 "OMP_NUM_THREADS" = "8" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 8  100 2048 55.40%
3 "OMP_NUM_THREADS" = "10" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 10  100 2048 54.65%
4 "OMP_NUM_THREADS" = "12" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 12  100 2048 55.56%
5 "OMP_NUM_THREADS" = "16" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 16  100 2048 55.15%
6 "OMP_NUM_THREADS" = "24" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 24  100 2048 54.73%

It has been observed that the training accuracy for all the runs has minimal difference. Setting the OMP_NUM_THREADS and the intra_op parameters to either 12, 16 or 24 showed that the training performance is 1.4 times faster (refer to configurations) compared to setting these parameters to either 6, 8 or 10.

#### Step 2

The OMP_NUM_THREADS and the intra_op parameters setting were set to 12 and the runs were performed for different batch sizes. The Top-1 training, validation and inference accuracies were captured in each case. Runs were performed for 25, 100, and 1000 epochs. The results are listed in the following table.

Batch Size Epochs Training Accuracy Validation Accuracy Testing Accuracy
64 25 70.22% 68.20% 68.58%
96 25 67.67% 65.74% 65.43%
128 25 64.93% 64.71% 65.27%
256 25 58.56% 57.34% 57.70%

Figure 3. Training with 25 epochs.

It was observed that while using a larger batch, there is a degradation in the quality of the model as there is a reduction in the stochasticity of the gradient descent. The accuracy fall is steeper when there is an increase in the batch size from 128 to 256. In general, the performance of processors is better if the batch size is a power of 2. Considering this, it was decided to perform runs with a higher epoch count on batch sizes of 64 and 128.

Batch Size Epochs Training Accuracy Validation Accuracy Testing Accuracy
64 100 94.77% 73.00% 72.21%
128 100 89.97% 71.76% 71.80%

Figure 4. Training with 100 epochs.

As the epoch count increased, the network showed improvement in accuracy. Also, the validation and testing accuracy was closely matching and the model seems to improve generalization. Continuing the experiments with larger number of epochs, the top-1 and top-5 training and testing accuracies were observed as follows.

Sr. No Batch Size Top-n Accuracy Training Accuracy Testing Accuracy
1 128 Top-5 100% 96.89%
2 128 Top-1 99.93% 72.42%
3 2048 Top-5 99.76% 97.29%
4 2048 Top-1 91.95% 68.94%

The following graph represents the top-1 and top-5 accuracies for training for every 100 epochs:

Figure 5. Training accuracy comparison (batch size: 128).

Figure 6. Training accuracy comparison (batch size: 2048).

Comparing the top-1 training and testing accuracy, it can be inferred that the network tends to over fit after 500 epochs. The reason could be that the model is training on the same data again.

## Conclusion

Observations from the experiments on training the AlexNet topology on Intel® Xeon® Gold processor powered machines, with Intel optimized TensorFlow using the CIFAR-10 classification dataset showed that by optimally setting environment parameters (especially OMP_NUM_THREADS and intra_op) better throughput and reduced training times can be achieved. With the optimal environment parameter configuration, choosing a smaller batch size helped improve the training accuracy.

Similar runs can be performed on future releases of Intel optimized TensorFlow that supports distributed mode to experience enhanced performance.

Rajeswari Ponnuru, Ajit Kumar Pookalangara, Ravi Keron Nidamarty, and Rishabh Kumar Jain are Technical Consulting Engineers working with the Intel® AI Developer Program.

## Acronyms and Abbreviations

Term/Acronym Definition
CIFAR-10 Established computer-vision dataset used for object recognition

## Configurations

For performance reference under Experimental Runs section:

Hardware: Refer to Hardware under Environment Setup
Software: Refer to Software under Environment Setup
While leaving the other settings same, the following 2 settings were changed while repeating the runs:
Runtime settings 1: "OMP_NUM_THREADS" = "6";'intra_op' = 6
Runtime settings 2: "OMP_NUM_THREADS" = "8";'intra_op' = 8
Runtime settings 3: "OMP_NUM_THREADS" = "10";'intra_op' = 10
Runtime settings 4: "OMP_NUM_THREADS" = "12";'intra_op' = 12
Runtime settings 5: "OMP_NUM_THREADS" = "16";'intra_op' = 16
Runtime settings 6: "OMP_NUM_THREADS" = "24";'intra_op' = 24

Test performed: Executed the script ~/Tensorflow_Cifar10_Alexnet_Experiments/ Tensorflow_Cifar10-V1-Alexnet_flags.py

Alexnet details