# Accelerate INT8 Inference Performance for Recommender Systems with Intel® Deep Learning Boost (Intel® DL Boost)

Published: 10/16/2019

Last Updated: 10/16/2019

By Zhenlin Luo, Andres Felipe Rodriguez Perez, Pallavi G, Gomathi Ramamurthy, SNEHA KOLA, Evarist M Fomenko, Rajesh Poornachandran, Ling Yan Guo, Karan Puttannaiah, Niveditha Sundaram, and Denis Samoilov

## Introduction

With the massive growth of online information, recommender systems have become indispensable for tackling the over-choice problem. The Wide & Deep Learning Recommender System (Cheng 2016) is an example deep learning (DL) topology that combines the benefits of feature interactions using *wide* linear models with the generalization of *deep* neural networks.

Traditional deep learning solutions or applications use 32 bits of floating-point precision (FP32) for training and inference. Deep learning inference with 8-bit (INT8) multipliers (accumulated to 32-bits) with minimal loss in accuracy (Norman 2017, login required) is common for various convolutional neural network (CNN) models (Gupta 2015, Lin 2016, Gong 2018). Results, however, on recommender systems have not been previously available.

The 2nd Generation Intel® Xeon® Scalable processor includes new embedded acceleration instructions known as Intel® Deep Learning Boost (Intel® DL Boost) that use Vector Neural Network Instructions (VNNI) to accelerate low precision performance. Intel® DL Boost improves throughput and reduces latency with up-to four times more compute than FP32. See Lower Numerical Precision Deep Learning Inference and Training for details. Inference with INT8 precision can accelerate computation performance, save memory bandwidth, provide better cache locality, and save power.

Using Intel DL Boost technology, we reported a 200% performance gain with INT8 using the Wide & Deep Recommender System with minimal loss of accuracy (less than 0.5%) from FP32 precision.^{1} We wrote this white paper to educate the industrial, academic and hobbyist communities on the quantization and optimization techniques we used to accelerate INT8 inference performance using Intel DL Boost and the Deep Neural Network Library (DNNL), formerly called Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN).

## Training and Optimization of FP32 Model

### Training

We used the Kaggle* Display Advertising Challenge Dataset from Criteo* AI Labs. The training dataset consists of a portion of Criteo's traffic over a period of 7 days that contains information about clicks on ads by users. Each row of data corresponds to a click by a user on a display ad served by Criteo. There are 13 columns of numerical features represented as integers and 26 columns of categorical features represented as hexadecimal values. The label column indicates whether the ad has been clicked. A total of 8,000,000 rows are used for training the model.

The numerical features are fed directly into the deep part after normalization (in the range [0,1]), while the categorical features are first hashed and fed into both wide and deep parts for embedding.

A one-hot vector of length 1000 is used for hashing each of the categorical features. The wide part learns sparse interactions between features, effectively using linear transformation, while the deep part generalizes the feature combinations using a feed-forward neural network with embedding layers and multiple hidden layers.

Figure 1. The numerical features are fed directly into the deep part after normalization, while the categorical features are first hashed (represented by a string to hash bucket block) and fed into both wide and deep parts for embedding.

Each of the 26 categorical features is first converted to a one-hot vector of length 1000, of hash bucket size 1000. Embedding is done on these sparse vectors to obtain a dense vector of length 32. The 13 numerical features are directly concatenated together. The hidden layers of multi-layer perceptron (MLP) are chosen to be of size 1024, 512, and 256, respectively.

Wide & Deep recommender systems are characterized by memory bandwidth-intensive operations, namely embedding. Moreover, since Intel DL Boost is used to accelerate the Fused Multiply Add (FMA), the overall performance improvement via Intel DL Boost depends on minimizing the execution time on non-FMA graph nodes (OPs). To fully leverage the inference accelerations provided by Intel DL Boost, the higher precision (FP32) Wide & Deep graph is first optimized.

The OPs within the trained model can vary depending on the framework being used. TensorFlow* Estimator provides one high-level API called DNNLinearCombinedClassifier for generating a Wide & Deep model. In Apache MXNet*, (Chen 2015), we need to manually add the OPs needed to build this model. Here, we describe the process involved in the optimization of pre-trained models using existing tools in the TensorFlow and MXNet frameworks.

### Optimization

The model trained in TensorFlow is optimized as follows:

- Training OPs that are not needed for inference are pruned using Graph Transform Tool, which provides a suite of tools to modify the model. We use
**strip_unused_nodes**,**remove_nodes**, and**remove_attribute**to prune training Ops. - Categorical columns are optimized by removing redundant and unnecessary OPs. The left portion of Figure 2 contains the unoptimized portion of the graph. These are optimized as described below:
- The Expand Dimension, GatherNd, NotEqual, and Where OPs that are used to get a non-empty input string of the required dimension are removed as they are redundant for the current dataset.
- Error checking and handling OPs (NotEqual, GreaterEqual, SparseFillEmptyRows, Unique, etc.) and unique value calculation and reconstruction OPs (Unique, SparseSegmentSum/Mean, StridedSlice, Pack, Tile, etc.) are removed as they are not necessary for the current dataset.

- Categorical column and embedding OPs are fused. The middle portion of Figure 2 contains unfused GatherV2 OPs. These are used as described below:
- 26 categorical columns are fused to use a single string to hash bucket lookup.
- 26 embedding calls using GatherV2 OPs in the deep part and 26 of the same in the wide part are correspondingly fused into 2 GatherNd.

- A separate data pre-processing session is used for string to hash bucket lookup for categorical columns and normalization for numerical columns.

Figure 2. Feature Column Optimization in TensorFlow. OPs that are not necessary for the dataset considered are first removed. Unfused GatherV2 are then combined to form Fused Gather for optimal performance.

The model trained in MXNet is optimized as follows:

- 26 Embedding calls using SparseEmbedding OPs in the deep part are fused into a single
**ParallelEmbedding**call. - A single Dot OP is used for linear transformation in the wide part.
- Memory manipulation OPs like slice and split, used to divide input features, are fused together.
- A separate data pre-processing session is used for string to hash bucket lookup for categorical columns and normalization for numerical columns.

The optimized FP32 model can be visualized as shown in Figure 3.

Figure 3. Visualization of Optimized FP32 Model. The numerical features are fed directly into the deep part, while categorical features are first hashed and fed into both wide and deep parts for embedding. The deep part has compute-intensive Fully Connected layers, whereas the wide part has a relatively less compute-intensive linear transformation (dot product in this case). The output from these parts are then operated on by Softmax activation to get the final output.

As part of future work, we plan to augment existing framework tools for more efficient graph optimization.

## Quantization

DNNL supports general matrix multiply functions, which can take INT8 input values and INT8 weight values to do matrix multiplication and output INT32 results. Further, bias, ReLU, requantization, and dequantization OPs of the fully connected layers can be fused. In this work, all fully connected (FC) layers are quantized to INT8 precision.

The quantization process followed in the TensorFlow and MXNet frameworks can be visualized in Figure 4. The FP32 model is first converted to a fused INT8 model. This involves quantizing the weights to INT8 precision, and then replacing FP32 OPs with fused INT8 OPs. For example, MatMul, BiasAdd, and ReLU are fused to form a single quantized OP. Tensor stats data (min, max, range) are collected using a calibration dataset, which is a training dataset subset. In TensorFlow, calibration is done using the quantized model, whereas in MXNet it is done using the FP32 model. The requantize OP is then fused with the quantized fully connected OPs of the corresponding layer. This is explained in more detail in Figure 5.

Figure 4. Quantization Process. The FP32 OPs are fused and converted to INT8 OPs. In TensorFlow, calibration is done using the quantized model, whereas in MXNet it is done using the FP32 model.

During quantization of the FC layers, the first FC layer processed with the OPs MatMul, BiasAdd and ReLU is converted to a single fused INT8 OP corresponding to each layer. The last FC layer processed with the OPs MatMul and BiasAdd is converted to another fused INT8 OP. The weights in FP32 precision are quantized (represented by QWeight in the figure), and the FP32 OPs (MatMul, Bias Add, ReLU) are replaced by corresponding quantized OPs that are fused to form a single OP. This process is shown by quantization and fusion OPs in Figure 5. Next, the requantize OP from each FC layer is fused with other fused INT8 OP of the corresponding layer, represented by requantized OP fusion in Figure 5.

Figure 5. Quantization and OPs fusion of fully connected layers. The first FC layers processed with the OPs MatMul, BiasAdd and ReLU are converted to a single fused INT8 OP corresponding to each layer. The last FC layer processed with the OPs MatMul and BiasAdd is converted to another fused INT8 OP. Next, the Requantize OP from each FC layer is fused with other fused INT8 OPs for the corresponding layers.

The first fully connected layer can have negative input values, whereas the subsequent fully connected layers only have positive input values because of ReLU activation ahead of each of them. Therefore, to get the best accuracy, we use a different quantization algorithm for the first FC compared to remaining FCs. Since recommendation input data distributions are not natively symmetric data like image or voice, we found that the asymmetric quantization for such input data can achieve lower accuracy loss than symmetric quantization.

Hence, we use 8-bit asymmetric quantization in the first FC as follows:

- If
*A*is input,*W*is weight and*B*is bias, then the quantization factor for input data to FC is,*Q*, so that the quantized data_{a}= 255/(max(A_{f32})-min(A_{f32}))*A*._{u8}= round(Q_{a}(A_{f32}-min(A_{f32}))) - The quantization factor for weights is,
*Q*, so that the quantized weight_{w}= 127/max(|W_{f32}|)*W*._{s8}= round(Q_{w}W_{f32}) - The subscripts “f32”, “s32”, “s8” and “u8” represent signed INT32, signed FP32, signed INT8, and unsigned INT8 precisions respectively.
- To accommodate the quantization done on input data and weights, the shifted bias becomes
*B'*._{s32}= Q_{a}Q_{w}B_{f32}+ Q_{a}min(A_{f32}) W_{s8} - The output from the first FC is then,
*X*where_{s32}= W_{s8}A_{u8}+ B'_{s32}= Q_{a}Q_{w}W_{f32}(A_{f32}-min(A_{f32})) + Qa Qw (W_{f32}A_{f32}+B_{f32})= Q_{a}Q_{w}X_{f32}*X*is the original output of first FC in FP32 precision._{f32} - For FC layers after the first layer, 8-bit symmetric quantization of input data is performed. The quantization factors for input to FC become
*255/(max(A*._{f32}))

As part of future work, to efficiently scale the process of quantization, we are developing scalable automatic quantization tools to quantize the model depending on accuracy and performance requirements.

## Inference Performance

Using Intel DL Boost technology with low-precision INT8 inferencing, we obtain the benefit of improved latency and throughput performance. Figure 6 shows a performance gain of two times more in records processed per second with an accuracy loss of less than 0.5% with INT8 inference as compared to FP32.

Figure 6. Inference performance with Intel DL Boost. Inference is done on an evaluation dataset of 2,000,000 samples using an optimized FP32 model. The performance in TensorFlow improves from 562,780 samples per second to 1,210,949 samples per second with batch size 512. The performance in MXNet improves from 522,284 samples per second to 1,059,480 samples per second with batch size 1024.^{1}

### Execution

- The steps to run inference with TensorFlow using pre-trained FP32 and quantized INT8 Wide & Deep models can be found on GitHub* at IntelAI/models
- The steps to do FP32 training and inference on FP32 and INT8 models with MXNet can be found at intel/optimized-models.

### Configuration Details

The results were obtained with:

- Intel® Xeon® Platinum 8280L Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0348.011820191451 (ucode:0x5000017), CentOS 7.6, Kernel 4.19.5-1.el7.elrepo.x86_64, SSD 1x INTEL SSDSC2KG96 960GB, Compiler gcc 6.3.1
- Deep Learning Framework: TensorFlow on Github at tensorflow/tensorflow applying Pull Request PR26169, Pull Request PR26261 and Pull Request PR26271, MKL-DNN version: v0.18, Wide & Deep on GitHub at IntelAI/models, Models:
Download FP32 pretrained model (9 MB, PB) and Download INT8 pretrained model (5 MB, PB)

- MXNet on Github at apache/incubator-mxnet applying patch; MKL-DNN on Github and intel/mkl-dnn;
- Wide & Deep on Github at intel/optimized-models;
- Dataset: Criteo* Display Advertisement Challenge, Batch Size=512 (with 28 concurrent batches in TensorFlow)

Inference performance is measured in data samples processed per second (higher is better).

## Summary

In this article, we described how the Wide & Deep recommender system model can be quantized and optimized to efficiently accelerate inference performance using Intel DL Boost. We showed that Intel DL Boost provides a 2x inference performance improvement with INT8 compared to FP32 precision, while maintaining accuracy loss below 0.5%.

## Footnotes

1. Performance results are based on internal testing as of March 1, 2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit https://www.intel.com/benchmarks.

Testing by Intel as of March 1, 2019. Intel® Xeon® Platinum 8280L Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0348.011820191451 (ucode:0x5000017), CentOS 7.6, Kernel 4.19.5-1.el7.elrepo.x86_64, SSD 1x INTEL SSDSC2KG96 960GB, Compiler gcc 6.3.1; Deep Learning Framework: TensorFlow on Github at tensorflow/tensorflow applying Pull Request PR26169, Pull Request PR26261 and Pull Request PR26271, MKL-DNN version: v0.18, Wide & Deep on GitHub at IntelAI/models, Models: FP32 pretrained model and INT8 pretrained model; MXNet on Github at apache/incubator-mxnet applying patch; MKL-DNN on Github and intel/mkl-dnn;, Wide & Deep on Github at intel/optimized-models; Dataset: Criteo Display Advertisement Challenge, Batch Size=512 (with 28 concurrent batches in TensorFlow)

© 2019 Intel Corporation.

^{1}

#### Product and Performance Information

^{1}

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.