CIFAR-10 Classification using Intel® Optimization for TensorFlow*
Published: 12/13/2017
Last Updated: 05/29/2018
Abstract
This work demonstrates the experiments conducted to train and test the deep learning AlexNet* topology with the Intel® optimized TensorFlow* library using CIFAR-10 classification data on Intel® Xeon® scalable processor powered machines. These experiments were conducted with options set at run time. From these runs, the training accuracy, validation accuracy, and testing accuracy numbers were captured for different environment configurations to identify the optimal combination of environment configurations. For the optimal combination identified, the top-1 and top-5 accuracies were plotted.
Document Content
Environment Setup
The following hardware and software environments were used to perform the experiments.
Hardware
Table 1. Intel® Xeon® Gold 6128 configuration
Architecture | x86_64 |
CPU op-mode(s) | 32 bit, 64 bit |
Byte order | Little endian |
CPU(s) | 24 |
Core(s) per socket | 6 |
Socket(s) | 2 |
Thread(s) per core | 2 |
CPU family | 6 |
Model | 85 |
Model name | Intel® Xeon® Gold 6128 CPU @ 3.40 GHz |
RAM | 92 GB |
Software
Table 2. On Intel Xeon Gold processor
TensorFlow* | 1.4.0 (Intel optimized) |
Python* | 3.6.3 (Intel distributed) |
The Intel Optimized TensorFlow wheel was installed through pip.
pip install https://anaconda.org/intel/tensorflow/1.4.0/download/tensorflow-1.4.0-cp36-cp36m-linux_x86_64.whl
Network Topology and Model Training
This section details the dataset adopted, AlexNet architecture, and training the model in the current work.
Dataset
The CIFAR-10 dataset chosen for these experiments consists of 60,000 32 x 32 color images in 10 classes. Each class has 6,000 images. The 10 classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
The dataset was taken from Kaggle*3. The following figure shows a sample set of images for each classification.
Figure 1. CIFAR-10 sample images
For the experiments, out of the 60,000 images, 50,000 images were chosen for training and 10,000 images for testing.
AlexNet* Architecture
The AlexNet network is made of five convolution layers, max-pooling layers, dropout layers, and three fully connected layers. The network was designed to be used for classification with 1,000 possible categories.
Figure 2. AlexNet architecture (credit: MIT2).
Model Training
In these experiments, it was decided to train the model from the beginning using the CIFAR-10 dataset. The dataset is split as 50,000 images for training and validation and 10,000 images for testing.
Experimental Runs
The experiment was conducted in two steps.
Step 1
Training was done with different combinations of environment options. The batch size and number of epochs were fixed in this step. The following table details the options chosen and the results for each run. In addition to the following settings, the Intel® Xeon® processors were made numactl
-aware.
Step | Environment Options | Epochs | Batch Size | Training Accuracy |
---|---|---|---|---|
1 | "OMP_NUM_THREADS" = "6" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 6 |
100 | 2048 | 55.20% |
2 | "OMP_NUM_THREADS" = "8" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 8 |
100 | 2048 | 55.40% |
3 | "OMP_NUM_THREADS" = "10" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 10 |
100 | 2048 | 54.65% |
4 | "OMP_NUM_THREADS" = "12" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 12 |
100 | 2048 | 55.56% |
5 | "OMP_NUM_THREADS" = "16" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 16 |
100 | 2048 | 55.15% |
6 | "OMP_NUM_THREADS" = "24" "KMP_BLOCKTIME" = "0" "KMP_SETTINGS" = "1" "KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0" 'inter_op' = 1 'intra_op' = 24 |
100 | 2048 | 54.73% |
It has been observed that the training accuracy for all the runs has minimal difference. Setting the OMP_NUM_THREADS and the intra_op parameters to either 12, 16 or 24 showed that the training performance is 1.4 times faster (refer to configurations) compared to setting these parameters to either 6, 8 or 10.
Step 2
The OMP_NUM_THREADS and the intra_op parameters setting were set to 12 and the runs were performed for different batch sizes. The Top-1 training, validation and inference accuracies were captured in each case. Runs were performed for 25, 100, and 1000 epochs. The results are listed in the following table.
Batch Size | Epochs | Training Accuracy | Validation Accuracy | Testing Accuracy |
---|---|---|---|---|
64 | 25 | 70.22% | 68.20% | 68.58% |
96 | 25 | 67.67% | 65.74% | 65.43% |
128 | 25 | 64.93% | 64.71% | 65.27% |
256 | 25 | 58.56% | 57.34% | 57.70% |
Figure 3. Training with 25 epochs.
It was observed that while using a larger batch, there is a degradation in the quality of the model as there is a reduction in the stochasticity of the gradient descent. The accuracy fall is steeper when there is an increase in the batch size from 128 to 256. In general, the performance of processors is better if the batch size is a power of 2. Considering this, it was decided to perform runs with a higher epoch count on batch sizes of 64 and 128.
Batch Size | Epochs | Training Accuracy | Validation Accuracy | Testing Accuracy |
---|---|---|---|---|
64 | 100 | 94.77% | 73.00% | 72.21% |
128 | 100 | 89.97% | 71.76% | 71.80% |
Figure 4. Training with 100 epochs.
As the epoch count increased, the network showed improvement in accuracy. Also, the validation and testing accuracy was closely matching and the model seems to improve generalization. Continuing the experiments with larger number of epochs, the top-1 and top-5 training and testing accuracies were observed as follows.
Sr. No | Batch Size | Top-n Accuracy | Training Accuracy | Testing Accuracy |
---|---|---|---|---|
1 | 128 | Top-5 | 100% | 96.89% |
2 | 128 | Top-1 | 99.93% | 72.42% |
3 | 2048 | Top-5 | 99.76% | 97.29% |
4 | 2048 | Top-1 | 91.95% | 68.94% |
The following graph represents the top-1 and top-5 accuracies for training for every 100 epochs:
Figure 5. Training accuracy comparison (batch size: 128).
Figure 6. Training accuracy comparison (batch size: 2048).
Comparing the top-1 training and testing accuracy, it can be inferred that the network tends to over fit after 500 epochs. The reason could be that the model is training on the same data again.
Conclusion
Observations from the experiments on training the AlexNet topology on Intel® Xeon® Gold processor powered machines, with Intel optimized TensorFlow using the CIFAR-10 classification dataset showed that by optimally setting environment parameters (especially OMP_NUM_THREADS and intra_op) better throughput and reduced training times can be achieved. With the optimal environment parameter configuration, choosing a smaller batch size helped improve the training accuracy.
Similar runs can be performed on future releases of Intel optimized TensorFlow that supports distributed mode to experience enhanced performance.
About the Author
Rajeswari Ponnuru, Ajit Kumar Pookalangara, Ravi Keron Nidamarty, and Rishabh Kumar Jain are Technical Consulting Engineers working with the Intel® AI Developer Program.
Acronyms and Abbreviations
Term/Acronym | Definition |
---|---|
CIFAR | Canadian Institute for Advanced Research |
CIFAR-10 | Established computer-vision dataset used for object recognition |
Configurations
For performance reference under Experimental Runs section:
Hardware: Refer to Hardware under Environment Setup
Software: Refer to Software under Environment Setup
While leaving the other settings same, the following 2 settings were changed while repeating the runs:
Runtime settings 1: "OMP_NUM_THREADS" = "6";'intra_op' = 6
Runtime settings 2: "OMP_NUM_THREADS" = "8";'intra_op' = 8
Runtime settings 3: "OMP_NUM_THREADS" = "10";'intra_op' = 10
Runtime settings 4: "OMP_NUM_THREADS" = "12";'intra_op' = 12
Runtime settings 5: "OMP_NUM_THREADS" = "16";'intra_op' = 16
Runtime settings 6: "OMP_NUM_THREADS" = "24";'intra_op' = 24
Test performed: Executed the script ~/Tensorflow_Cifar10_Alexnet_Experiments/ Tensorflow_Cifar10-V1-Alexnet_flags.py
For more information go to Product Performance.
References
- TensorFlow* Optimizations on Modern Intel® Architecture
- Alexnet topology diagram
- CIFAR-10 dataset taken from
Related Resources
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.