CIFAR-10 Classification using Intel® Optimization for TensorFlow*

Published: 12/13/2017  

Last Updated: 05/29/2018

Abstract

This work demonstrates the experiments conducted to train and test the deep learning AlexNet* topology with the Intel® optimized TensorFlow* library using CIFAR-10 classification data on Intel® Xeon® scalable processor powered machines. These experiments were conducted with options set at run time. From these runs, the training accuracy, validation accuracy, and testing accuracy numbers were captured for different environment configurations to identify the optimal combination of environment configurations. For the optimal combination identified, the top-1 and top-5 accuracies were plotted.

Document Content

Environment Setup

The following hardware and software environments were used to perform the experiments.

Hardware

Table 1. Intel® Xeon® Gold 6128 configuration

Architecture x86_64
CPU op-mode(s) 32 bit, 64 bit
Byte order Little endian
CPU(s) 24
Core(s) per socket 6
Socket(s) 2
Thread(s) per core 2
CPU family 6
Model 85
Model name Intel® Xeon® Gold 6128 CPU @ 3.40 GHz
RAM 92 GB

Software

Table 2. On Intel Xeon Gold processor

TensorFlow* 1.4.0 (Intel optimized)
Python* 3.6.3 (Intel distributed)

The Intel Optimized TensorFlow wheel was installed through pip.

pip install https://anaconda.org/intel/tensorflow/1.4.0/download/tensorflow-1.4.0-cp36-cp36m-linux_x86_64.whl

Network Topology and Model Training

This section details the dataset adopted, AlexNet architecture, and training the model in the current work.

Dataset

The CIFAR-10 dataset chosen for these experiments consists of 60,000 32 x 32 color images in 10 classes. Each class has 6,000 images. The 10 classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

The dataset was taken from Kaggle*3. The following figure shows a sample set of images for each classification.

CIFAR-10 sample images

Figure 1. CIFAR-10 sample images

For the experiments, out of the 60,000 images, 50,000 images were chosen for training and 10,000 images for testing.

AlexNet* Architecture

The AlexNet network is made of five convolution layers, max-pooling layers, dropout layers, and three fully connected layers. The network was designed to be used for classification with 1,000 possible categories.

 MIT2).

Figure 2. AlexNet architecture (credit: MIT2).

Model Training

In these experiments, it was decided to train the model from the beginning using the CIFAR-10 dataset. The dataset is split as 50,000 images for training and validation and 10,000 images for testing.

Experimental Runs

The experiment was conducted in two steps.

Step 1

Training was done with different combinations of environment options. The batch size and number of epochs were fixed in this step. The following table details the options chosen and the results for each run. In addition to the following settings, the Intel® Xeon® processors were made numactl-aware.

Step Environment Options Epochs Batch Size Training Accuracy
1 "OMP_NUM_THREADS" = "6" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 6 100 2048 55.20%
2 "OMP_NUM_THREADS" = "8" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 8 100 2048 55.40%
3 "OMP_NUM_THREADS" = "10" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 10 100 2048 54.65%
4 "OMP_NUM_THREADS" = "12" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 12 100 2048 55.56%
5 "OMP_NUM_THREADS" = "16" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 16 100 2048 55.15%
6 "OMP_NUM_THREADS" = "24" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 24 100 2048 54.73%

It has been observed that the training accuracy for all the runs has minimal difference. Setting the OMP_NUM_THREADS and the intra_op parameters to either 12, 16 or 24 showed that the training performance is 1.4 times faster (refer to configurations) compared to setting these parameters to either 6, 8 or 10.

Step 2

The OMP_NUM_THREADS and the intra_op parameters setting were set to 12 and the runs were performed for different batch sizes. The Top-1 training, validation and inference accuracies were captured in each case. Runs were performed for 25, 100, and 1000 epochs. The results are listed in the following table.

Batch Size Epochs Training Accuracy Validation Accuracy Testing Accuracy
64 25 70.22% 68.20% 68.58%
96 25 67.67% 65.74% 65.43%
128 25 64.93% 64.71% 65.27%
256 25 58.56% 57.34% 57.70%

Chart 25 epochs training on 1M images - accuracies

Figure 3. Training with 25 epochs.

It was observed that while using a larger batch, there is a degradation in the quality of the model as there is a reduction in the stochasticity of the gradient descent. The accuracy fall is steeper when there is an increase in the batch size from 128 to 256. In general, the performance of processors is better if the batch size is a power of 2. Considering this, it was decided to perform runs with a higher epoch count on batch sizes of 64 and 128.

Batch Size Epochs Training Accuracy Validation Accuracy Testing Accuracy
64 100 94.77% 73.00% 72.21%
128 100 89.97% 71.76% 71.80%

 

Chart 100 epochs - training on 4M images - accuracies

Figure 4. Training with 100 epochs.

As the epoch count increased, the network showed improvement in accuracy. Also, the validation and testing accuracy was closely matching and the model seems to improve generalization. Continuing the experiments with larger number of epochs, the top-1 and top-5 training and testing accuracies were observed as follows.

Sr. No Batch Size Top-n Accuracy Training Accuracy Testing Accuracy
1 128 Top-5 100% 96.89%
2 128 Top-1 99.93% 72.42%
3 2048 Top-5 99.76% 97.29%
4 2048 Top-1 91.95% 68.94%

The following graph represents the top-1 and top-5 accuracies for training for every 100 epochs:

Chart 1000 epochs - top1 and top5 training

Figure 5. Training accuracy comparison (batch size: 128).

Chart 100 epochs top1 and top5 training

Figure 6. Training accuracy comparison (batch size: 2048).

Comparing the top-1 training and testing accuracy, it can be inferred that the network tends to over fit after 500 epochs. The reason could be that the model is training on the same data again.

Conclusion

Observations from the experiments on training the AlexNet topology on Intel® Xeon® Gold processor powered machines, with Intel optimized TensorFlow using the CIFAR-10 classification dataset showed that by optimally setting environment parameters (especially OMP_NUM_THREADS and intra_op) better throughput and reduced training times can be achieved. With the optimal environment parameter configuration, choosing a smaller batch size helped improve the training accuracy.

Similar runs can be performed on future releases of Intel optimized TensorFlow that supports distributed mode to experience enhanced performance.

About the Author

Rajeswari Ponnuru, Ajit Kumar Pookalangara, Ravi Keron Nidamarty, and Rishabh Kumar Jain are Technical Consulting Engineers working with the Intel® AI Developer Program.

Acronyms and Abbreviations

Term/Acronym Definition
CIFAR Canadian Institute for Advanced Research
CIFAR-10 Established computer-vision dataset used for object recognition

Configurations

For performance reference under Experimental Runs section: 

Hardware: Refer to Hardware under Environment Setup 
Software: Refer to Software under Environment Setup
While leaving the other settings same, the following 2 settings were changed while repeating the runs:
Runtime settings 1: "OMP_NUM_THREADS" = "6";'intra_op' = 6
Runtime settings 2: "OMP_NUM_THREADS" = "8";'intra_op' = 8
Runtime settings 3: "OMP_NUM_THREADS" = "10";'intra_op' = 10
Runtime settings 4: "OMP_NUM_THREADS" = "12";'intra_op' = 12
Runtime settings 5: "OMP_NUM_THREADS" = "16";'intra_op' = 16
Runtime settings 6: "OMP_NUM_THREADS" = "24";'intra_op' = 24

Test performed: Executed the script ~/Tensorflow_Cifar10_Alexnet_Experiments/ Tensorflow_Cifar10-V1-Alexnet_flags.py 

For more information go to Product Performance.

References

  1. TensorFlow* Optimizations on Modern Intel® Architecture
  2. Alexnet topology diagram
  3. CIFAR-10 dataset taken from

Related Resources

Alexnet details

About CIFAR-10 data

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.