TensorFlow* is one of the leading deep learning and machine learning frameworks today. Earlier in 2017, Intel worked with Google to incorporate optimizations for Intel® Xeon® and Xeon Phi™ processor based platforms using Intel® Math Kernel Libraries (Intel® MKL). These optimizations resulted in orders of magnitude improvement in performance - up to 70x higher performance for training and up to 85x higher performance for inference.
In this blog we provide a performance update for a number of deep learning models running on the Intel Xeon Scalable processor. The Intel Xeon Scalable processor provides up to 28 cores, which brings additional computing power to the table compared to the 22 cores of its predecessor. Additional improvements include a non-inclusive, last-level cache, a larger 1MB L2 cache, faster 2666 MHz DDR4 memory, and an increase to six memory channels per CPU. In addition, the Intel Xeon Scalable processor includes Intel® Advanced Vector Extensions 512 (Intel® AVX-512), originally introduced with the Intel® Xeon Phi™ processor product line. The Intel Xeon Scalable processor introduces new Intel AVX-512 CPUID flags (AVX512BW and AVX512DQ) as well as a new capability (AVX512VL) to expand the benefits of the technology. The AVX512DQ CPUID flag is focused on new additions for benefiting high-performance computing (HPC) and machine learning workloads.
The optimizations discussed in this article utilize the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). This is an open source performance library for Deep Learning applications, intended for acceleration of DL frameworks on Intel® architecture. Intel MKL-DNN includes highly vectorized and threaded building blocks for implementation of convolutional neural networks with C and C++ interfaces. Note that TensorFlow currently supports the open-sourced Intel MKL-DNN as well the DNN primitives in the closed source Intel Math Kernel Library. The version to use is selected when building TensorFlow. It is expected that in the future the support for the closed source DNN primitive will be removed from TensorFlow.
Optimizing deep learning model performance on the Intel Xeon Scalable processor utilizes several optimizing techniques that are similar to performance-sensitive applications in High Performance Computing (HPC):
- Code refactoring needed to take advantage of Intel AVX-512 instructions. This means ensuring that all the key primitives, such as convolution, matrix multiplication, and batch normalization are vectorized to the latest Intel AVX-512 instructions.
- Maximum performance requires paying special attention to using all the available cores efficiently. Again this means looking at parallelization within a given layer or operation as well as parallelization across layers.
- As much as possible, data has to be available when the execution units need it. This means balanced use of prefetching, cache blocking techniques and data formats that promote spatial and temporal locality.
Intel® MKL-DNN provides a number of optimized deep learning primitives that are highly optimized for Intel Xeon Scalable processors using the optimizations described above. Using the optimized primitives inside various deep learning frameworks helps ensure that we implement common building blocks efficiently. These include:
- 2D convolution
- Inner product/Matrix multiplication
- Pooling: maximum, average
- Normalization: local response normalization across channels (LRN), batch normalization
- Activation: rectified linear unit (ReLU)
- Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale.
In TensorFlow, we implemented optimized versions of TensorFlow operations to make sure that these operations can utilize optimized MKL-DNN primitives for Intel Xeon Scalable CPUs wherever possible. While this is a necessary step to enable scalable performance on Intel® architecture, to get the best performance we implemented several additional optimizations including the following:
- Layout optimizations: Intel MKL uses a different layout than the default layout in TensorFlow. For performance reasons this requires frequent conversions from MKL layout to TensorFlow layout, several optimizations were used to keep the overhead of conversion between the two formats to a minimum.
- Replace default TensorFlow operations with Intel optimized versions when running on CPUs. This ensures that users can run their existing Python programs and realize the performance gains without changes to their neural network model.
- Fuse multiple operations together to enable efficient cache reuse on CPU.
- Propagate intermediate states between forward and backward passes to improve back-propagation performance.
- Our custom CPU pool allocator that helps avoid costly page misses and page clears.
The following performance results were obtained for benchmark models from the TensorFlow repository at https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
To get maximum performance we tuned the following parameters specifically for the model and for the processor.
Data format: Set to NCHW format to get maximum performance. (TensorFlow default NHWC format is not the most efficient data layout for the CPU and it results in some additional conversion overhead.)
- Inter-op / intra-op: This setting impacts parallelism within one layer as well as across layers. Typically inter-op is set 2 and intra-op is set to the number of logical cores. The values are refined for the specific model and CPU.
- Batch size: batch size is another important parameter that impacts both the available parallelism to utilize all the cores as well as working set size and memory performance in general.
- OMP_NUM_THREADS: maximum performance requires using all the available cores efficiently. This setting typically sets to the same as the intra-op, but also needs to be tuned for best performance.
- Transpose in Matrix multiplication: for some matrix sizes, transposing the second input matrix b provides better performance (better cache reuse) in the MatMul layer. This is the case for all the MatMul operations used in the three models below. Users should experiment with this setting for other matrix sizes.
- KMP_BLOCKTIME: This is the time in milliseconds a thread should wait after completing the execution of a parallel region. Typically this is set to small value 1, but some models such as Alexnet will need a higher setting.
Settings on Intel® Xeon® Scalable processor (2 Sockets, 28 Cores each) that were used for benchmarking.
Please note: The parameter settings were carefully tuned to gain maximum performance for the specific platform.
Performance results for Training on Intel® Xeon® Scalable processor (2 Sockets – 28 Cores each), mock data.
Performance results for Inference on Intel® Xeon® Scalable processor (2 Sockets – 28 Cores each), mock data. Inference performance was measured by running forward pass only.
In conclusion, TensorFlow now supports the Intel Xeon Scalable platform through the Intel MKL-DNN open source library. No additional software or configuration is required other than building TensorFlow with specific Intel MKL build settings. We are continually improving the performance of the Intel® Optimization for TensorFlow* and will be updating the repository on a continual basis.
Special thanks to Intel contributors Huma Bidi, Mahmoud Abuzaina, Md Faijul Amin, Mohammad Ashraf Bhuiyan, Jayaram Bobba, Xiaoming Cui, Sheng Fu, Niranjan, Hasabnis, Jing Huang, Jennifer Myers, Elmoustapha Ould-ahmed-vall, Clayne Robison, Bhavani Subramanian, Lakshay Tokas, Wei Wang, Karen Wu, and Guozhong Zhuang.
Notices & Disclaimers
Intel® technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit https://www.intel.com/benchmarks.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #201108
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
© Intel Corporation. Intel, the Intel® logo, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others.
 The results are reported at https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
 Same as (i)
 Refer to https://github.com/01org/mkl-dnn for more details on Intel® MKL-DNN optimized primitives
 For the complete list of optimizations, refer to https://github.com/pennsate/AIM2017/raw/master/AIM-accelerating.pdf
 System configuration: CPU: Intel Xeon Platinum 8180 processor @ 2.50GHz; OS CentOS 7.4; TensorFlow Source Code: https://github.com/tensorflow/tensorflow; TensorFlow Commit ID: 926fc13f7378d14fa7980963c4fe774e5922e336. Detailed configuration is as follows:
CPU Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz Stepping: 4 HyperThreading: ON Turbo: ON Memory 376GB (12 x 32GB) 24 slots, 12 occupied 2666 MHz Disks Intel RS3WC080 x 3 (800GB, 1.6TB, 6TB) BIOS SE5C620.86B.00.01.0004.071220170215 OS Centos Linux 7.4.1708 (Core) Kernel 3.10.0-693.11.6.el7.x86_64