Accelerate Deep Learning Applications Using Multiprocessing and Intel® oneAPI Math Kernel Library (oneMKL) for Deep Neural Networks

Published: 10/28/2019  

Last Updated: 10/28/2019

Overview

Deep learning is transforming the way the world processes information. Advances in technology are allowing data to be collected at a continually increasing rate, and there is a need to quickly process large datasets to gain meaningful insights. Deep neural networks (DNNs) have a wide range of applications, offering the ability to extract key information from images, videos, audio, categorical data, textual data, and more. Because DNNs may need to process very large volumes of data, doing this efficiently is critical.

The DarwinAI* Generative Synthesis platform uses artificial intelligence (AI) to generate compact, highly efficient neural network models from existing model definitions. The produced models are lighter and less computationally expensive while maintaining accuracy that approaches the original. This results in faster inference times and the ability to target various hardware specifications. The DarwinAI Generative Synthesis platform supports the generation of a wide variety of neural network topologies that may span over many different applications.

Optimization Approach

The Intel® Xeon® Platinum 8153 processor is a high-performance, 32-core server processor. The processor is configurable using OpenMP*, an open source high-performance threading library, and Intel® oneAPI Math Kernel Library (oneMKL) for Deep Neural Networks (oneMKL-DNN) libraries. The oneMKL-DNN supports highly optimized neural network operations and is integrated into the Anaconda* tensorflow-mkl repository maintained by Intel.

Testing was done on the Intel® AI Builders DevCloud using a single dual-socket node using Intel Xeon Platinum 8153 processors at 2.00 GHz and 384 GB of memory running CentOS Linux* release 7.4.1708 (Core).

For this workload's optimization, several environment variables were considered. OpenMP's OMP_NUM_THREADSKMP_AFFINITY, and KMP_BLOCKTIME—as well as TensorFlow*'s inter_op_parallelism_threads and intra_op_parallelism threads—were adjusted to yield the best performance.

An Anaconda* environment was created using the Intel channel and tensorflow-mkl 1.13.1 package. For each topology, two pretrained models were provided in checkpoint format. Models were loaded into TensorFlow using an inference script provided by DarwinAI. The original script provided by DarwinAI was also modified for testing to include flags for multiprocessing, batch size, number of inferences per process, and OpenMP oneMKL-DNN environment variables.

The total inference time was calculated as the time taken to run the specified number of inference iterations in a process. Each process reported the frames per second (FPS) to the shared variable, and the FPS output of each run was recorded to a .csv file for analysis.

The base runs for each model used a single process, with no environment variables or TensorFlow configurations set.

Subsequent runs were done first using multiprocessing to distribute work among several processes, and then additionally with added OpenMP and oneMKL-DNN configurations.

Environment variable configurations were tested following the multiprocess pass using loops and small batch and iteration sizes for initial evaluation of performances in each case. Afterwards, more targeted looping was done with increased batch size and number of iterations to acquire a more stable FPS.

Topologies

DarwinAI provided model definitions of three topologies for optimization analysis. One is an audio processing model (EdgeSpeech) that detects sounds within an audio stream, and the other two models (NASNet and ResNet50) are vision-processing networks that can be used for object recognition in images or videos.

The EdgeSpeech model is an audio-inferencing network that can detect speech from audio. It is meant to run on edge devices because of its compact size and compute requirements. The model comprises residual blocks that maintain a “memory” of the samples it has detected over time. These memories help EdgeSpeech decide whether or not it has heard a particular class.

The NASNet model is a state-of-the-art image classification network, generated by the Google* AutoML machine-learning network algorithm. This tool, which was developed to automate much of the neural network design, can produce models for specific applications.

The ResNet50 model is a convolutional neural network with residual "skip" connections that allow much deeper networks to be trained accurately. It won the ImageNet Large Scale Visual Recognition Challenge in 2015 with a 3.57 percent error on the ImageNet dataset.

For each model, both an original definition and an optimized version were provided by DarwinAI for testing.

Software Configurations

We created an environment using the Anaconda data science platform for Python*, and then installed the Intel® Distribution for Python* 3.6.3 and TensorFlow MKL 1.13.1. Follow the instructions in these sites to install the required software:

Hardware Configuration

Table 1 lists the hardware configurations used in the tests.

Table 1. Hardware configurations

  Intel® AI Builders DevCloud
(Intel® Xeon® Platinum 8153 processor)
Platform Intel® Server Board S2600BPB
Number of nodes 1
Number of sockets 2
CPU Intel Xeon Platinum 8153 processor at 2.00GHz

 

Cores/socket, threads/socket 16/32
Ucode 0x200004d
Hyperthreading (HT) On
Turbo On
BIOS version SE5C620.86B.00.01.0015.110720180833
System DDR memory configuration:
slots / cap / run-speed
12 slots / 32 GB / 2666 MHz / DDR4 DIMM
Total memory per node (DDR+DCPMM) 384 GB
Network interface controller (NIC) Intel® Ethernet Controller 10 Gigabit X550T
Platform controller hub (PCH) Intel® C621 chipset
OS Ubuntu* 18.04.2 LTS
Kernel Linux* 4.15.0-46-generic
Mitigation variants (1,2,3,3a,4, L1TF) Full mitigation

DarwinAI EdgeSpeech Performance

Table 2 lists the baseline and optimized software configurations (and versions) used in the test. 

Table 2. Software configurations for the DarwinAI EdgeSpeech test

  Baseline Configuration Optimized Configuration
Workload and version EdgeSpeech (original) EdgeSpeech (DarwinAI)
Library version oneMKL-DNN 2019.3 oneMKL-DNN 2019.3
Framework version TensorFlow 1.13 (conda tensorflow-mkl) TensorFlow 1.13 (conda tensorflow-mkl)
Python version Python 3.6.3 Python 3.6.3
Dataset Custom 10 Class Audio Custom 10 Class Audio
Topology  EdgeSpeech (original) EdgeSpeech (DarwinAI)
Batch size n/a n/a
Streams 1 1
Additional run-time command line parameters
(inter_op, intra_op, OMP settings)
Default TensorFlow Settings OMP_NUM_THREADS=1
KMP_BLOCKTIME=1
inter_op_parallelism_threads=1
intra_op_parallelism_threads=1

The Evaluation

We compared the performance between the original inference script and a multi-inference version of the script that was tuned with oneMKL-DNN parameters (see Table 2).

The best performing multiprocess configuration was used to fine-tune the workload using environment parameters.

Experiment Results

Table 3 shows unoptimized runs for both the original and DarwinAI-generated models, as well as optimized runs for each.

Table 3. Intel® Optimization for TensorFlow* classification inference; SPS on Intel Xeon Platinum 8153 processor1

  Processes OMP inter_op intra_op KMP_BLOCKTIME Batchsize Iterations FPS CPU(%) Vect.(%) AVX-512
Original 1 unset unset unset unset n/a 50 77.52 87.7 95.5 99.1
Original (oneMKL-DNN) 30 1 1 1 1 n/a 50 568.06 96 97.9 99.1
DarwinAI 1 unset unset unset unset n/a 50 141.96 75.1 86.9 97.4
DarwinAI (oneMKL-DNN) 30 1 1 1 1 n/a 50 1080.43 93.8 91.4 97.4

 

We found the best configuration to be as follows:

Processes 30
OMP_NUM_THREADS 1
INTER_OP_PARALLELISM_THREADS 1
INTRA_OP_PARALLELISM_THREADS 1
KMP_BLOCKTIME 1
KMP_AFFINITY ‘granularity=fine,compact,1,0’

EdgeSpeech Performance graph
Figure 1. Intel Xeon Platinum 8153 processor: DarwinAI EdgeSpeech normalized performance (FPS)

Between the original and the DarwinAI optimized models there was a 1.83x increase in performance (77.52 samples per second (SPS) versus 141.96 SPS). Optimization done by Intel, including oneMKL-DNN tuning and multiprocessing, achieved another 7.6x increase in performance at 1080.43 SPS. As Figure 1 shows, the total performance increase from the original to the DarwinAI EdgeSpeech with multi-inference and oneMKL-DNN tuning was 13.9x.

DarwinAI NASNet Performance

Table 4 lists the software configurations used in the test.

Table 4. Software configuration for the DarwinAI NASNet test

  Baseline Configuration Optimized Configuration
Workload and version NASNet (original) NASNet (DarwinAI)
Library version oneMKL-DNN 2019.3 oneMKL-DNN 2019.3
Framework version TensorFlow 1.13 (conda tensorflow-mkl) TensorFlow 1.13 (conda tensorflow-mkl)
Python version Python 3.6.3 Python 3.6.3
Dataset Cifar10 Cifar10
Topology (include link) NASNet (DarwinAI) NASNet (DarwinAI)
Batch size 32 32
Streams 1 30
Additional run time command line parameters
(Inter_op, intra_op, OMP settings)
Default TensorFlow Settings OMP_NUM_THREADS=1
KMP_BLOCKTIME=1
inter_op_parallelism_threads=2
intra_op_parallelism_threads=1

The Evaluation

We compared the performance between the original inference script and a multi-inference version of the script, which was tuned with oneMKL-DNN parameters (see Table 4).

The best performing multiprocess configuration was used to fine-tune the workload using environment parameters.

Experiment Results

Table 5 shows unoptimized runs for both the original and DarwinAI-generated models, as well as optimized runs for each.

Table 5. Intel Optimization for TensorFlow classification inference; SPS on Intel Xeon Platinum 8153 processor1

DarwinAI NASNet Performance
  Procs OMP inter_op intra_op Blocktime Batchsize Iters FPS CPU(%) Vect.(%) AVX-512 AVX-128 Normalized Latency
Original (TensorFlow 1.13.1-mkl) 1 unset unset unset unset 32 30 20.19587 58.8 75.2 60.3 39.7   0.049515
Original (30 Procs + oneMKL-DNN tuning) 30 1 2 1 1 32 30 147.4264 91.8 98.3 60.6 39.4   0.006783
DarwinAI 1 unset unset unset unset 32 30 32.25049 59 73.3 61.5 38.5 1 0.031007
DarwinAI (30 Procs + oneMKL-DNN tuning) 30 1 2 1 1 32 30 311.42 91.7 98.3 60.3 39.7 9.656288 0.003211

NASNet Performance graph
Figure 2. Intel Xeon Platinum 8153 processor: DarwinAI NASNet normalized performance (FPS)

There was a performance improvement of 1.59x between the standard unoptimized model (20.19 FPS) and the DarwinAI optimized model (32.25 FPS). Figure 2 shows there was another 9.6x increase in performance between the standard unoptimized model and the DarwinAI model using multiprocessing and oneMKL-DNN configurations (311.42 FPS).

DarwinAI ResNet50 Performance

Table 6 lists the software configurations used in this test.

Table 6. Software configurations for the DarwinAI ResNet50 test

  Baseline Configuration Optimized Configuration
Workload and version ResNet50 (original) ResNet50 (DarwinAI)
Library version oneMKL-DNN 2019.3 oneMKL-DNN 2019.3
Framework version TensorFlow 1.13 (conda tensorflow-mkl) TensorFlow 1.13 (conda tensorflow-mkl)
Python version Python 3.6.3 Python 3.6.3
Dataset ImageNet ImageNet
Topology  ResNet50 (original) ResNet (DarwinAI)
Batch size 32 32
Streams 1 30
Additional run time command line parameters
(inter_op, intra_op, OMP settings)
Default TensorFlow settings OMP_NUM_THREADS=2
KMP_BLOCKTIME=1
inter_op_parallelism_threads=1
intra_op_parallelism_threads=1

The Evaluation

We compared the performance between the original inference script and a multi-inference version of the script, which was tuned with oneMKL-DNN parameters (see Table 6). 

The best performing multiprocess configuration was used to fine-tune the workload using environment parameters.

Experiment Results

Table 7 shows unoptimized runs for both the original and DarwinAI generated models, as well as optimized runs for each.

 

Table 7. Intel Optimization for TensorFlow classification inference SPS; Intel Xeon Platinum 8153 processor1

DarwinAI NASNet Performance
  Procs OMP inter_op intra_op Blocktime Batchsize Iters FPS CPU(%) Vect.(%) AVX-512 AVX-128 Normalized Latency
Original (TensorFlow 1.13.1-mkl) 1 unset unset unset unset 32 30 20.19587 58.8 75.2 60.3 39.7   0.049515
Original (30 Procs + oneMKL-DNN tuning) 30 1 2 1 1 32 30 147.4264 91.8 98.3 60.6 39.4   0.006783
DarwinAI 1 unset unset unset unset 32 30 32.25049 59 73.3 61.5 38.5 1 0.031007
DarwinAI (30 Procs + oneMKL-DNN tuning) 30 1 2 1 1 32 30 311.42 91.7 98.3 60.3 39.7 9.656288 0.003211

We found the best configuration to be as follows: 

Processes 32
OMP_NUM_THREADS 2
INTER_OP_PARALLELISM_THREADS 1
INTRA_OP_PARALLELISM_THREADS 1
KMP_BLOCKTIME 1
KMP_AFFINITY ‘granularity=fine,compact,1,0’
Batch Size 32

RESnet performance graph
Figure 3. Intel Xeon Platinum 8153 processor: DarwinAI ResNet50 normalized performance (FPS)

There was a performance improvement of 1.07x between the standard unoptimized model (21.39 FPS) and the DarwinAI optimized model (23.035 FPS). Figure 3 shows a 16.3x performance increase between the standard unoptimized model and the DarwinAI model using multiprocessing and oneMKL-DNN configurations (377.43 FPS).

Conclusion

DarwinAI Generative Synthesis technology offers the ability to produce neural networks tjhat are smaller in size, requiring less computational power while maintaining their functional fidelity. The models generated by DarwinAI not only showed a dramatic performance increase over the original models but also exhibited an increased response to tuning on Intel hardware. Coupled with Intel Xeon Platinum processors and Intel® software, these models demonstrate the ability to significantly outperform their baseline counterparts.

Intel provides many AI-enabling technologies including the Intel Optimization for TensorFlow and the oneMKL-DNN, which allow for accelerated performance of neural network applications.

References

DarwinAI*

EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge

Anaconda | The World's Most Popular Data Science Platform

Installing Intel® Distribution for Python* and Intel® Performance Libraries with Anaconda*

Intel® Optimization for TensorFlow* Installation Guide

oneMKL

oneMKL-DNN on GitHub*

1Performance results are based on testing as of October 31, 2019, and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information, see Performance Benchmark Test Disclosure.

Platform: S2600BPB, # Nodes: 1, # Sockets: 2, CPU: Intel Xeon Platinum 8256 CPU @ 3.80GHz, Cores/Socket, Threads/socket: 4/8, ucode: 0x500001c, HT: On, Turbo: On, BIOS Version: SE5C620.86B.02.01.0008.031920191559, System DDR Memory Configuration: slots/cap/run-speed:12 slots / 32 GB / 2666 MTs / DDR4 DIMM, Total Memory/Node (DDR+DCPMM): 376 GB, Storage (boot): Intel SSD SC2KB48 480GB (1GB boot partition), Storage (application drives): Intel SSD SC2KB48 480GB (444GB application partition), NIC:  Intel Ethernet Controller 10G X550T, PCH: Intel C621 chipset, OS: Ubuntu* 18.04.2 LTS, Kernel: 4.15.0-48-generic.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.