Medical imaging analysis. Natural language processing. Investigating science’s most-challenging questions. Organizations around the world are choosing Intel® architecture for the AI compute they need. 2nd Generation Intel® Xeon® Scalable processors, the only microprocessor with built-in AI inference acceleration, have the versatility to excel at workloads such as analytics, high performance computing, and business-critical databases that are adding AI capabilities. Intel architecture’s enduring excellence for these workloads means it is well-understood by developers and readily available in their infrastructure, which allows organizations to achieve faster time to value by running AI inference on their existing IT investment.
With Intel® Deep Learning Boost (Intel® DL Boost), our 2nd Gen Intel Xeon Scalable processors provide a better platform for AI than ever before, boosting throughput for inference applications by up to 14x in comparison to the first generation of Intel Xeon Scalable processors. In my talk today at the AI Conference in New York, I’ll dive deep into Intel DL Boost’s Vector Neural Network Instructions (VNNI) and how they improve AI performance by combining three instructions into one — thereby maximizing the use of compute resources, utilizing the cache better and avoiding potential bandwidth bottlenecks. Based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512), VNNI speeds the delivery of inference results – and potentially, critical insights. Please read on for an introduction to VNNI and join me at the AI Conference if you’d like to learn more.
How Vector Neural Network Instructions Work
VNNI can be thought of as an AI inference acceleration integrated into every 2nd Gen Intel Xeon Scalable processor. Their benefits are best demonstrated by comparing them to similar instructions used in our previous generation of Intel Xeon Scalable processors, as shown below.
Most deep learning applications today use 32-bits of floating point precision for training and inference workloads. In the previous generation of Intel Xeon Scalable processors, the convolution operations predominant in neural network workloads were implemented in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) using the FP32 data type via the vfmadd231ps instructions in the Intel® AVX-512 instruction set. Intel Xeon Scalable processors were the first Intel Xeon CPUs to include Intel AVX-512, with up to two 512 bit FMA units computing in parallel per core, enabling the execution of two vfmadd231ps instructions in a given cycle.
Lately, the Int8 data type has been used successfully for deep learning inference with a significant boost to performance and little loss of accuracy. Int8 uses 8 bits to represent integer data with 7 bits of mantissa and a sign bit versus FP32 uses 32 bits to represent floating point data with 22 bits of Mantissa, 8 bits of exponent and a sign bit. This reduction in number of bits with Int8 used for inference brings the benefits of better memory and compute utilization, since less data is being transferred and data is being processed more efficiently. Previous generation Intel Xeon Scalable processors implemented convolution operations in Intel MKL-DNN using the Intel AVX-512 instructions vpmaddubsw, vpmaddwd, and vpaddd to take advantage of low-precision data. Although this gave some performance improvement compared to the use of FP32 data types for convolution, the use of three instructions in Int8 convolution and the microarchitecture limit of only two 512-bit instructions in a clock cycle leaves room for further innovation.
In 2nd Gen Intel Xeon Scalable processors with VNNI, convolutions in Intel® MKL-DNN occur in Int8 precision via an individual vpdpbusd Intel AVX-512 instruction. Since the low precision operation now uses a single instruction, two of these instructions can be executed in a given cycle. Reduced precision and a single instruction optimizes utilization of the microarchitecture for each convolution operation in a neural network and brings significant performance benefits.
Neural network inference requires weights from a trained model to perform forward propagation. These weights are often stored in FP32 precision during training. Floating point data types such as FP32 helps to maintain accuracy and ensure convergence during training. To take advantage of low precision inference, the FP32 weights from the trained model are converted to Int8 through a process called quantization. This conversion from a floating point data type to integer data type may result in some loss in accuracy. So, how can we take advantage of the benefits of using Int8 data type in inference without sacrificing accuracy?
Post training, we collect statistics for the activation in order to find an appropriate quantization factor. Using the quantization factor we perform post-training quantization for 8-bit inference. In addition, there is a technique called quantization-aware training that employs “fake” quantization in the networks during training so the captured FP32 weights are quantized to int8 at each iteration after the weight updates. This technique of quantization-aware training in some cases enables us to get slightly better accuracy.
You can realize the performance benefits of VNNI on the 2nd Gen Intel Xeon Scalable processor with the quantization techniques via the Intel® Distribution of OpenVino™ toolkit or Intel-optimized frameworks such as TensorFlow* and PyTorch*.
Benefits of Vector Neural Network Instructions
With VNNI, low-precision inference is possible using the processors already trusted by so many organizations for so many other tasks. Therefore, AI capabilities can be more easily integrated alongside other workloads on versatile, multi-purpose 2nd Gen Intel Xeon Scalable processors. Further, performance can significantly improve for both batch inference and real-time inference, because vector neural network instructions reduces both the number and complexity of convolution operations required for AI inference, which also reduces the compute power and memory accesses these operations require.
Learn More at O’Reilly AI NYC
If you’re interested in learning more about VNNI, Intel DL Boost, and Intel’s wider, edge-to-cloud technology portfolio for AI, please attend my session Understanding and Integrating Intel Deep Learning Boost on Wednesday, April 17th, at 4:05pm at O’Reilly AI NYC. Please also stay tuned to intel.ai and follow along on Twitter at @IntelAI.
Acknowledgements: Akhilesh Kumar, Nagib Hakim, Vikram Saletore, Andres Rodriguez, Evarist Fomenko, Indu Kalyanaraman, Ramesh AG, Emily Hutson
 14x inference throughput improvement on Intel® Xeon® Platinum 8280 processor with Intel DL Boost. For details see https://www.intel.ai/2ndgenxeonscalable/.
Notices and Disclaimers
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.
Performance results are based on testing or projections as of 7/11/2017 to 4/1/2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice (Notice Revision #20110804).
The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.
Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Intel, the Intel logo, OpenVINO, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others. © Intel Corporation.