Accelerating INT8 Inference Performance for Recommender Systems

This blog describes how to quantize the model weights and activations and the lower numerical functions available in the Intel DNNL to efficiently accelerate the performance of the Wide and Deep learning recommender model using Intel DL Boost.

Most inference applications today require low latency, high memory bandwidth, and large compute capacity. With the increasing use and growing memory footprint of the recommender systems that make up 50-60% of all inference workloads in the data center [1], [2], these requirements are expected to become stronger. Intel® Xeon® Scalable processors continue to hold strong inference value for recommendation systems, especially for sparse models with large memory footprints that cannot fit into an accelerator. Recently, Intel researchers demonstrated that deep learning inference can be performed with lower numerical precision, using 8-bit multipliers with minimal to no loss in accuracy. There are two main benefits of lower numerical precision. First, many operations are memory bandwidth-bound, so reducing precision enables better cache usage while lowering bandwidth bottlenecks. Second, the hardware may enable higher operations per second (OPS) at lower numerical precision, as these multipliers require less silicon area and power.
In this article, we describe INT8 data type acceleration using Intel® Deep Learning Boost (Intel® DL Boost), available in 2nd Generation Intel® Xeon® Scalable processors, the only microprocessor with built-in AI inference acceleration. The 2nd Gen Intel Xeon Scalable processor family includes the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set with 512-bit wide Fused Multiply Add (FMA) core instructions. These instructions enable lower numerical precision multiplies with higher precision accumulates. These specialized high-performing instructions provide embedded acceleration via Intel DL Boost to speed up low-precision inference. Further, Intel provides optimized software support with libraries such as the Intel® Deep Neural Network Library (Intel® DNNL) that take direct advantage of such CPU features.
We also describe how to quantize the model weights and activations and the lower numerical functions available in the Intel DNNL to efficiently accelerate the performance of the Wide and Deep learning recommender model [3] using Intel® DL Boost. The embedding lookup portion of the model, which typically has a high memory footprint, can take advantage of the high memory bandwidth and capacity available in Intel Xeon Scalable processors. The compute-intensive neural network portion (fully-connected layers) takes advantage of accelerated performance with low-precision (INT8) provided by Intel DL Boost. We describe how the model can be optimized for the best performance with the dataset in consideration. Further, Intel DNNL supports general matrix multiply functions, which can take INT8 input values and INT8 weight values to do matrix multiplication to output INT32 results. We explain how fully connected layers of the deep portion of Wide and Deep model are quantized to utilize DNNL functions for accelerated inference performance.
We show that Intel DL Boost provides a 2x inference performance improvement with INT8 compared to FP32 precision, while maintaining accuracy loss below 0.5%. [4] This is demonstrated for low batch size use cases which are typical of recommender systems on popular machine learning frameworks like TensorFlow and MXNet. For more complete details on how we achieved this performance improvement, please read the complete article. Follow us on Twitter for more updates from our AI research team.
[1] Norman P. Jouppi et al., In-Datacenter Performance Analysis of a Tensor Processing Unit. https://arxiv.org/abs/1704.04760.

[2] Jongsoo Park et al., Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications, 2018. https://arxiv.org/abs/1811.09886.

[3] Heng-Tze Cheng, Wide & Deep Learning for Recommender Systems, 2016. https://arxiv.org/abs/1606.07792.

[4] Configuration: Intel Xeon Platinum 8280L Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0348.011820191451 (ucode:0x5000017), CentOS 7.6, Kernel 4.19.5-1.el7.elrepo.x86_64, SSD 1x INTEL SSDSC2KG96 960GB, Compiler gcc 6.3.1; Deep Learning Framework: TensorFlow on Github at tensorflow/tensorflow applying Pull Request PR26169, Pull Request PR26261 and Pull Request PR26271,[PK2] MKL-DNN version: v0.18, Wide & Deep on GitHub at IntelAI/models, Models: FP32 pretrained model and INT8 pretrained model; MXNet on Github at apache/incubator-mxnet applying patch; MKL-DNN on Github and intel/mkl-dnn;, Wide & Deep on Github at intel/optimized-models; Dataset: Criteo Display Advertisement Challenge, Batch Size=512 (with 28 concurrent batches in TensorFlow). Inference performance is measured in data samples processed per second (higher is better).

Notices & Disclaimers

​Performance results are based on internal testing as of March 1, 2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks.

Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Stay Connected


Keep tabs on all the latest news with our monthly newsletter.