**WHITE PAPER** 



# LOWER NUMERICAL PRECISION DEEP LEARNING INFERENCE AND TRAINING

## Andres Rodriguez, Eden Segal, Etay Meiri, Evarist Fomenko, Young Jim Kim, Haihao Shen, and Barukh Ziv

#### January 2018

#### Introduction

Most commercial deep learning applications today use 32-bits of floating point precision (fp32) for training and inference workloads. Various researchers have demonstrated that both deep learning training and inference can be performed with lower numerical precision, using 16-bit multipliers for training and 8-bit multipliers or fewer for inference with minimal to no loss in accuracy (higher precision – 16-bits vs. 8-bits – is usually needed during training to accurately represent the gradients during the backpropagation phase). Using these lower numerical precisions (training with 16-bit multipliers accumulated to 32-bits or more and inference with 8-bit multipliers accumulated to 32-bits) will likely become the standard over the next year, in particular for convolutional neural networks (CNNs).

There are two main benefits of lower numerical precision. First, many operations are memory bandwidth bound, and reducing precision would allow for better usage of cache and reduction of bandwidth bottlenecks. Thus, data can be moved faster through the memory hierarchy to maximize compute resources. Second, the hardware may enable higher operations per second (OPS) at lower numerical precision as these multipliers require less silicon area and power.

In this article, we review the history of lower numerical precision training and inference and describe how Intel® is enabling lower numerical precision for deep learning inference and training. Specifically, we describe instructions available in the current Intel® Xeon® Scalable processors and instructions that will be available in some future processors. We describe how to quantize the model weights and activations and the lower numerical functions available in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). Finally, we describe how deep learning frameworks take advantage of these lower numerical precision functions and reduce the conversion overhead between different numerical precisions. Each section can be read independently of other sections and the reader may skip to their section of interest. A detailed exposition including commercial examples of deep learning training with the Intel Xeon Scalable processors is presented <u>elsewhere</u>.

#### Brief History of Lower Numerical Precision in Deep Learning

Researchers have demonstrated deep learning training with 16-bit multipliers and inference with 8-bit multipliers or less of numerical precision accumulated to higher precision with minimal to no loss in accuracy across various models. Vanhoucke, et al. (2011) quantized activations and weights to 8-bits and kept the biases and first layer input at full precision (fp32) for the task of speech recognition on CPUs. Hwang, et al. (2014) trained a simple network with quantized weights of -1, 0 and 1 in the feed forward propagation and updated the high precision weights in the back propagation using the MNIST\* and TIMIT\* datasets with negligible performance loss. Courbariaux, et al. (2015) used the MNIST, CIFAR-10\*, and SVHN\* datasets to train with lower numerical precision multipliers and high precision accumulators and updated the high precision weights. They proposed combining dynamic fixed point (having one shared exponent for a tensor or high dimensional array) with Gupta, et al.'s (2015) stochastic rounding as future work. This became the core piece of Koster, et al.'s (2017) use of the Flexpoint numerical format in the Intel<sup>®</sup> Nervana<sup>™</sup> Neural Network Processors (Intel Nervana NNP). Kim and Smaragdis (2016) trained with binary weights and updated on full precision weights with competitive performance on the MNIST dataset. Mivashita, et al. (2016) encoded the weights and activations in a base-2 logarithmic representation (since weights/activations have a non-uniform distribution). They trained CIFAR-10 with 5-bits weights and 4-bit activations resulting in minimal performance degradation. Rastegari, et al. (2016) trained AlexNet with binary weights (except for the first and last layers) and updated on full precision weights with a top-1 2.9% accuracy loss. Based on their experiments, they recommend avoiding binarization in fully connected layers and convolutional layers with small channels or filter sizes (e.g., 1x1 kernels). Mellempudi, et al. (2017) from Intel Labs trained ResNet-101 with 4-bit weights and 8-bit activations in convolutional layers while doing updates in full precision with a top-1 2% accuracy loss. Micikevicius, et al. (2017) trained with 16-bit floating-point (fp16) multipliers and full precision accumulators and updated the full precision weights with negligible to no loss in accuracy for AlexNet\*, VGG-D\*, GoogLeNet\*, ResNet-50\*, Faster R-CNN\*, Multibox SSD\*, DeepSpeech2\*, Sequence-to-Sequence\*, bigLSTM\*, and DCGAN\* (some models required gradient scaling to match full precision results). Baidu researchers (2017) successfully used 8-bits of fixed precision with 1 sign bit, 4bits for the integer part and 3-bits for the fractional part. Sze. et al. (2017) used various guantization techniques (see Table 3 in their paper) showing minimal to no loss at reduced precision (except for the first and last layers which were kept at full precision). An anonymous submission to ICLR 2018 details how to generate state-of-the-art on ResNet-50, GoogLeNet, VGG-16, and AlexNet using 16-bits integer multipliers and 32-bit accumulators.

#### Lower numerical precision with Intel Xeon Scalable processors

The Intel Xeon Scalable processor now includes the Intel® Advance Vector Extension 512 (Intel® AVX-512) instruction set which have the <u>512-bit wide Fused Multiply Add (FMA) core</u> instructions. These instructions enable lower numerical precision multiplies with higher

precision accumulates. Multiplying two 8-bit values and accumulating the result to 32-bits requires 3 instructions and requires one of the 8-bit vectors to be in *unsigned int*8 (*u*8) format, the other in *signed int*8 (*s*8) format with the accumulation in *signed int*32 (*s*32) format. This allows for 4x more input at the cost of 3x more instructions or 33.33% more compute with 1/4 the memory requirement. The reduced memory and higher frequency for lower numerical precision operations makes it even faster. See Figure 1 for details<sup>1</sup>.

| 1 VPMADDUBSW<br>vector multiply into 16-bit + horizontal add of neighbors<br>executes on Port 0 in Cycle 0 |               |                                  |                                  |                                |                                  |  | <b>N</b><br>tal add of neighbors<br>Cycle 0                        |    | 2 VPMADDWD<br>effectively upconvert to 32-bit and horizontal add of neighbors<br>executes on Port 5 in Cycle 0 (uses the multiplier to multiply by 1) |                                                 |                                                                                 |                                                                     |                                                                     | ;<br>1) | 3<br>exe        | VPADDD<br>ecutes on Port 0 in Cycle 1 (does not use the multiplier) |  |                                                                              |                                                                       |
|------------------------------------------------------------------------------------------------------------|---------------|----------------------------------|----------------------------------|--------------------------------|----------------------------------|--|--------------------------------------------------------------------|----|-------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------|-----------------|---------------------------------------------------------------------|--|------------------------------------------------------------------------------|-----------------------------------------------------------------------|
| S                                                                                                          | RC 1<br>B-bit | Ao                               | A <sub>1</sub>                   | Az                             | A <sub>3</sub>                   |  | A <sub>63</sub>                                                    |    | SRC <sup>-</sup><br>16-bi                                                                                                                             | $A_0 * B_0 + A_1$                               | $B_1 A_2 B_2 + A_3 B_3$                                                         |                                                                     | A <sub>62</sub> *B <sub>62</sub> + A <sub>63</sub> *B <sub>63</sub> | _       | 32-bit          | $A_0^*B_0 + A_1^*B_1 + A_2^*B_2 + A_3^*B_3$                         |  | A <sub>60</sub> *B <sub>60</sub> + A <sub>61</sub> *B <sub>61</sub> +        | • A <sub>62</sub> *B <sub>62</sub> + A <sub>63</sub> *B <sub>63</sub> |
| s                                                                                                          | RC 2<br>B-bit | Bo                               | B <sub>1</sub>                   | B <sub>2</sub>                 | B <sub>3</sub>                   |  | B <sub>63</sub>                                                    | 1/ | SRC:                                                                                                                                                  | 1                                               | 1                                                                               |                                                                     | 1                                                                   | //      | SRC 2<br>32-bit | Co                                                                  |  |                                                                              | C <sub>15</sub>                                                       |
| 1                                                                                                          | EST<br>6-bit  | A <sub>o</sub> *B <sub>o</sub> + | • A <sub>1</sub> *B <sub>1</sub> | A <sub>2</sub> *B <sub>2</sub> | + A <sub>3</sub> *B <sub>3</sub> |  | A <sub>62</sub> *B <sub>62</sub> + A <sub>63</sub> *B <sub>6</sub> | ,  | DEST<br>32-bit                                                                                                                                        | A <sub>0</sub> *B <sub>0</sub> + A <sub>1</sub> | <sup>*</sup> B <sub>1</sub> + A <sub>2</sub> *B <sub>2</sub> + A <sub>3</sub> * | B <sub>3</sub> A <sub>60</sub> *B <sub>60</sub> + A <sub>61</sub> * | $B_{61} + A_{62} * B_{62} + A_{63} * B_{63}$                        |         | Dest<br>32-bit  | $A_0^*B_0 + A_1^*B_1 + A_2^*B_2 + A_3^*B_3 + C_0$                   |  | A <sub>60</sub> *B <sub>60</sub> + A <sub>61</sub> *B <sub>61</sub> ·<br>+ ( | + A <sub>62</sub> *B <sub>62</sub> + A <sub>63</sub> *B <sub>63</sub> |
|                                                                                                            |               |                                  |                                  |                                |                                  |  |                                                                    | _  |                                                                                                                                                       |                                                 |                                                                                 |                                                                     |                                                                     |         |                 |                                                                     |  |                                                                              |                                                                       |

Figure 1: The Intel Xeon Scalable processor enables 8-bit multiplies with 32-bit accumulates with 3 instructions: VPMADDUBSW  $u8 \times s8 \rightarrow s16$  multiples, VPMADDWD broadcast1  $s16 \rightarrow s32$ , and VPADDD  $s32 \rightarrow s32$  adds the result to accumulator. This allows for 4x more input over fp32 at the cost of 3x more instructions or 33.33% more compute and 1/4 the memory requirement. The reduced memory and higher frequency available with lower numerical precision makes it even faster. Image credit to Israel Hirsh.

The Intel AVX-512 instructions also enable 16-bit multiplies. Multiplying two 16-bit values and accumulating the result to 32-bits requires 2 instructions (2 cycles) with both 16-bit vectors to be in signed *int*16 (*s*16) format and the accumulation in signed *int*32 (*s*32) format. This allows for 2x more input at the cost of 2x more instructions, resulting in no additional compute. It does, however, reduce the memory requirement and bandwidth bottlenecks, both of which may improve the overall performance. See Figure 2 for details.

<sup>&</sup>lt;sup>1</sup> The raw compute can be calculated as AVX-512-frequency \* number-of-cores \* number-of-FMAs-per-core \* 2operations-per-FMA \* SIMD-vector-length / number-of-bits-in-numerical-format / number-of-instructions. Two 512-bit FMA units located in Ports 0 and 5 computing in parallel per core are available in the Intel Xeon Platinum processors, Intel Xeon Gold processors 6000 series and 5122 available. Other Intel Xeon Scalable processor stock keeping units (SKUs) have one FMA unit per core located in Port 0 (see <u>Chapter 15</u> for details). *fp*32, *int*16, and *int*8 FMAs require 1, 2, and 3 instructions, respectively, with the Intel AVX-512 instructions. The Intel Xeon Platinum 8180 has 28 cores per socket and 2 FMAs per core. The *fp*32 OPS per socket are approximately 1.99-GHz-AVX-512-frequency \* 28-cores \* 2-FMA-units-per-core \* 2-OPS-per-FMA \* 512-bits / 32-bits / 1-instruction = 3.570 *fp*32 TOPS. The *int*8 OPS per socket are approximately 2.17-GHz-AVX-512-frequency \* 28-cores \* 2-FMA-unitsper-core \* 2-OPS-per-FMA \* 512-bits / 8-bits / 3-instruction = 5.185 *int*8 TOPS. The AVX-512 frequencies for multiple SKUs can be found here (these correspond to *fp*64 operations—the frequencies for lower numerical precision are higher and the ones used in the *fp*32 and *int*8 TOPS computations above are estimates). The AVX-512 max turbo-frequency may not be fully sustained when running high OPS workloads.

| (1)<br>ve              | VPMADDWD<br>rector multiply into 32-bit + horizontal add of neighbors<br>executes on Port 0 |                    |                                |                                  |                                  |  |                                                                    | (            | 2               | VPADDD<br>executes on Port 5 (does not use the multiplier)      |                                                                                     |  |                                                                    |
|------------------------|---------------------------------------------------------------------------------------------|--------------------|--------------------------------|----------------------------------|----------------------------------|--|--------------------------------------------------------------------|--------------|-----------------|-----------------------------------------------------------------|-------------------------------------------------------------------------------------|--|--------------------------------------------------------------------|
| <b>SRC 1</b><br>16-bit | A <sub>0</sub>                                                                              |                    | <b>A</b> 1                     | A <sub>2</sub>                   | A <sub>3</sub>                   |  | A <sub>31</sub>                                                    |              | 16-bit          | A <sub>0</sub> *B <sub>0</sub> + A <sub>1</sub> *B <sub>1</sub> | A <sub>2</sub> *B <sub>2</sub> + A <sub>3</sub> *B <sub>3</sub>                     |  | A <sub>30</sub> *B <sub>30</sub> +A <sub>31</sub> *B <sub>31</sub> |
| SRC 2<br>16-bit        | Bo                                                                                          |                    | B <sub>1</sub>                 | B2                               | B <sub>3</sub>                   |  | B <sub>31</sub>                                                    | T            | SRC 2<br>32-bit | C <sub>0</sub>                                                  | <b>C</b> <sub>1</sub>                                                               |  | C <sub>15</sub>                                                    |
| DEST<br>32-bit         | A <sub>0</sub> *B                                                                           | B <sub>0</sub> + / | A <sub>1</sub> *B <sub>1</sub> | A <sub>2</sub> *B <sub>2</sub> + | ⊦ A <sub>3</sub> *B <sub>3</sub> |  | A <sub>30</sub> *B <sub>30</sub> +A <sub>31</sub> *B <sub>31</sub> | $\checkmark$ | Dest<br>32-bit  | $A_0 * B_0 + A_1 * B_1 + C_0$                                   | A <sub>2</sub> *B <sub>2</sub> + A <sub>3</sub> *B <sub>3</sub><br>+ C <sub>1</sub> |  | $A_{30}^*B_{30} + A_{31}^*B_{31} + C_{15}$                         |

Figure 2: The Intel Xeon Scalable processor core is capable of 16-bit multiplies with 32-bit accumulates with 2 instructions: VPMADDWD  $s16 \times s16 \rightarrow s32$  multiples, and VPADDD  $s32 \rightarrow s32$  adds the result to accumulator. This allows for 2x more input over fp32 at the cost of 2x more instructions or no more compute and 1/2 the memory requirement. Image credit to Israel Hirsh.

Intel developed the AVX512\_VNNI (Vector Neural Network Instruction), a new set of Intel AVX-512 instructions to boost DL performance. Ice Lake and other future microarchitectures (see Table 1-1) will have the AVX512\_VNNI instructions. AVX512\_VNNI includes 1) an FMA instruction for 8-bit multiplies with 32-bits accumulates  $u8 \times s8 \rightarrow s32$  as shown in Figure 3, and 2) an FMA instruction for 16-bit multiplies with 32-bit accumulates  $s16 \times s16 \rightarrow s32$  as shown in Figure 4. The theoretical peak compute gains are 4x int8 OPS and 2x int16 OPS over fp32 OPS, respectively. Practically, the gains may be lower due to memory bandwidth bottlenecks.

| VPDPBUSD<br>executes on both Port 0 and Port 5 in 1 cycle |                                |                                       |                                                      |                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |   |                 |                 |  |  |  |
|-----------------------------------------------------------|--------------------------------|---------------------------------------|------------------------------------------------------|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|-----------------|-----------------|--|--|--|
| SRC 1<br>8-bit                                            | A <sub>0</sub>                 | <b>A</b> 1                            | A <sub>2</sub> A <sub>3</sub>                        |                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |   |                 | A <sub>63</sub> |  |  |  |
| SRC 2<br>8-bit                                            | B <sub>0</sub>                 | B1                                    | B <sub>2</sub>                                       | B <sub>3</sub> |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |   |                 |                 |  |  |  |
| SRC3/                                                     |                                | (                                     | 20                                                   |                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | • | C <sub>15</sub> | C <sub>15</sub> |  |  |  |
| 32-bit                                                    | A <sub>0</sub> *B <sub>0</sub> | + A <sub>1</sub> *B <sub>1</sub><br>+ | + A <sub>2</sub> *B <sub>2</sub> +<br>C <sub>0</sub> | + A₃*B₃        | $\begin{array}{c c} \cdot \cdot \cdot \\ A_{60}^{*}B_{60} + A_{61}^{*}B_{61} + A_{62}^{*}B_{62} + A_{63}^{*}B_{62} + A_{63}^{*}B_{62} + A_{63}^{*}B_{63} + C_{15}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63}^{*}B_{63$ |   |                 |                 |  |  |  |

Figure 3: AVX512\_VNNI enables 8-bit multiplies with 32-bit accumulates with 1 instruction. The VPMADDUBSW, VPMADDWD, VPADDD instructions in Figure 1 are fused into the VPDPBUSD instruction  $u8 \times s8 \rightarrow s32$ . This allows for 4x more inputs over fp32 and (theoretical peak) 4x more compute with 1/4 the memory requirements. Image credit to Israel Hirsh.

| ISA Fusi        | VPDPWSSD<br>executes on both Port 0 and Port 5 in 1 cycle |                                                    |                                           |                |  |                                                           |                                |  |  |  |  |
|-----------------|-----------------------------------------------------------|----------------------------------------------------|-------------------------------------------|----------------|--|-----------------------------------------------------------|--------------------------------|--|--|--|--|
| SRC 1<br>16-bit | A <sub>0</sub>                                            | <b>A</b> 1                                         | A <sub>2</sub>                            | A <sub>3</sub> |  |                                                           | <b>A</b> <sub>31</sub>         |  |  |  |  |
| SRC 2<br>16-bit | B <sub>0</sub>                                            | B <sub>1</sub>                                     | B2                                        | B <sub>3</sub> |  |                                                           | B <sub>31</sub>                |  |  |  |  |
| SRC3/<br>DEST   | C <sub>0</sub>                                            |                                                    | <b>C</b> <sub>1</sub>                     |                |  | C <sub>15</sub>                                           |                                |  |  |  |  |
| 32-bit          | A <sub>0</sub> *B <sub>0</sub> +                          | + A <sub>1</sub> *B <sub>1</sub><br>C <sub>0</sub> | $B_1 = A_2 B_2 + A_3 B_2 + A_3 B_1 + C_1$ |                |  | A <sub>30</sub> *B <sub>30</sub> + A<br>+ C <sub>15</sub> | <sub>31</sub> *B <sub>31</sub> |  |  |  |  |

Figure 4: AVX512\_VNNI enables 16-bit multiplies with 32-bit accumulates with 1 instruction. The VPMADDWD, VPADDD instructions in Figure 2 are fused into the VPDPWSSD instruction  $s16 \times s16 \rightarrow s32$ . This allows for 2x more inputs over fp32 and (theoretical peak) 2x more compute with 1/2 the memory requirements. Image credit to Israel Hirsh.

A potential issue is the undefined behavior on overflows that may occur when using the VPMADDUBSW instruction  $u8 \times s8 \rightarrow s16$  (see Figure 1). This is a problem when both u8 and s8 values are near their maximum values<sup>2</sup>. This can be mitigated by reducing the precision of the inputs by 1-bit. This is not in practice an issue when using the AVX512\_VNNI VPDPBUSD FMA instruction  $u8 \times s8 \rightarrow s32$ .

An overflow is more likely to occur with the AVX512\_VNNI VPDPWSSD FMA instruction  $s16 \times s16 \rightarrow s32$ . This can be similarly mitigated by reducing the precision of the activations and the weights by 1 or 2 bits. Another technique to prevent overflow is to use a second accumulator at fp32, and convert to fp32 and use that accumulator after a set number of s32 accumulates. Preliminary results show that statistical performance does not suffer using these techniques.

Compiler support for these AVX512\_VNNI instructions is underway. <u>GCC 8</u> development code and <u>LLVM/Clang 6.0</u> compiler already support AVX512\_VNNI instructions. The <u>X86 Encoder</u> <u>Decoder (XED)</u> and the <u>Intel software developer emulator (SDE)</u> October 2017 update adds support for AVX512\_VNNI instructions.

## Intel MKL-DNN Library Lower Numerical Precision Primitives

The Intel MKL-DNN library contains popular deep learning functions or primitives used across various models such as inner products, convolutions, rectified linear units (ReLU), and batch normalization (BN), along with functions necessary to manipulate the layout of tensors or high dimensional arrays. Intel MKL-DNN is optimized for Intel processors with Intel AVX-512, Intel<sup>®</sup> AVX-2, and Intel<sup>®</sup> Streaming SIMD Extensions 4.2 (Intel<sup>®</sup> SSE4.2) instructions. These functions use fp32 for training and inference workloads. Recently, new functions were introduced to

 $<sup>^2</sup>$  in practice these u8 values are usually closer to their minimum than their maximum if they activations are preceded by the ReLU activation function

support inference workloads with 8-bits of precision in convolutional, ReLU, fused convolutional plus ReLU, and pooling layers. Functions for long short-term memory (LSTM), other fused operations, and Winograd convolutions with 8-bits are designated as future work. Intel MKL-DNN support for 16-bit multiplies using the AVX512\_VNNI instructions is designed as future work when the instructions become available.

Currently, Intel MKL-DNN does not have a local response normalization (LRN), fully connected (FC), softmax, or batch normalization (BN) layers implemented with 8-bits of precision (only with fp32) for the following reasons. Modern models do not use LRN and older models can be modified to use batch normalization, instead. Modern CNN models do not typically have many FC layers, although adding support for FC layers is designated as future work. The softmax function currently requires full precision as it does not maintain accuracy with 8-bits of precision. A BN *inference* layer is not needed as it can be absorbed by its preceding layer by scaling the weight values and modifying the bias as discussed in the *Enabling Lower Numerical Precision in the Frameworks* section.

Intel MKL-DNN implements the 8-bit convolution operations with the activation (or input) values in u8 format, the weights in s8 format and the biases in s32 format (biases can be kept in fp32 as well as they take a very small percentage of the overall compute). Figure 5 shows the process of inference operations with 8-bit multipliers accumulated to s32.



Figure 5: The data layer or the first convolution layer activations are quantized to u8 as inputs to the next convolutional layer. The weights are quantized to s8 and the bias is formatted to s32 and added to the s32 convolution accumulate. The framework chooses the format of the convolution output as s8, u8, or s32 depending on the parameters of the following layer. Image credit to Jiong Gong.

#### 8-bit quantization of activations with non-negative values and weights

Intel MKL-DNN currently assumes that the activations are non-negative, which is the case after the ReLU activation function. Later in this article we discuss how to quantize activations with negative values. Intel MKL-DNN quantizes the values for a given tensor or for each channel in a tensor (the choice is up to the framework developers) as follows.

 $R_{\{a,w\}} = \max(abs(\mathbb{T}_{\{a,w\}}))$ , where  $\mathbb{T}_{\{a,w\}}$  is a tensor corresponding to either the weights w or the activations or model inputs a.

 $Q_a = \frac{255}{R_a}$  is the quantization factor for activations with non-negative values, and  $Q_w = \frac{127}{R_w}$  is the quantization factor for the weights. The quantized activation, weights, and bias are:

$$a_{u8} = ||Q_a a_{f32}|| \in [0,255]$$
$$W_{s8} = ||Q_w W_{f32}|| \in [-127,127]$$
$$b_{s32} = ||Q_a Q_w b_{f32}|| \in [-2^{31}, 2^{31} - 2^{31}]$$

where the function  $\|\cdot\|$  rounds to the nearest integer. Note that while the *s*8 format supports - 128, the smallest quantized *s*8 weight value use is -127.

The affine transformation using 8-bit multipliers and 32-bit accumulates results in

$$\boldsymbol{x}_{s32} = \boldsymbol{W}_{s8}\boldsymbol{a}_{u8} + \boldsymbol{b}_{s32} \approx Q_a Q_w (\boldsymbol{W}_{f32}\boldsymbol{a}_{f32} + \boldsymbol{b}_{f32}) = Q_a Q_w \boldsymbol{x}_{f32}$$

1]

where the approximation is because the equation ignores the rounding operation, and

$$x_{f32} = W_{f32}a_{f32} + b_{f32} \approx \frac{1}{Q_a Q_w} x_{s32} = D x_{s32}$$

is the affine transformation with f32 format, and  $D = \frac{1}{Q_a Q_w}$  is the dequantization factor.

In quantizing to u8 and s8 formats, a zero value maps to a specific value without any rounding. Given that zero is one of the most common values, it is advantageous to have exact mappings to reduce quantization errors and improve statistical accuracy.

The quantization factors above can be in fp32 format in the Intel Xeon Scalable processors. However, some architectures do not support divides (e.g., FPGAs) and use shifts. For those architectures, the scalar is rounded to the nearest power-of-two and the scaling is done with bit-shifts. The reduction in statistical accuracy is minimal (usually <1%).

## Efficient 8-bit multiplies

In Figure 6, we demonstrate how to efficiently perform the 8-bit multiplies for  $A \times W$ . Intel MKL-DNN uses an *NHWC* data layout for the activation tensors where *N* is the batch size, *H* is the height, *W* is the width, and *C* is the number of channels, and an  $\left(\frac{O}{16}\right) K\left(\frac{C}{4}\right) T1604c$  data layout for the weight tensors where *O* is the number kernels or output channels, *C* is the number of input channels, K is the height, and T is the width. The first 32-bits (4 *int*8 values) of

tensor *A* shown in gray are broadcasted 16 times to fill a 512-bit register. Intel MKL-DNN modifies the data layout of tensor *W* after quantizing the weights. Tensor *W* data layout is rearranged as *W'* by groups of 16 columns, with each column having 32-bits (4 *int*8 values) to be read continuous in memory starting with the first 4 values in column 1 occupying the first 32-bits of the register (red), the next 4x1 occupying the next 32-bits of the register (orange), and so forth (green). The second, third, and fourth block (yellow) below the first block are rearranged in the same pattern. The next set of blocks (blue) follows. In practice, tensor *W* is usually transposed before re-arranging the memory layout in order to access 1x4 continuous memory values rather than 4x1 scatter values when rearranging the data layout. Modifying this data layout is usually done once and stored for reuse for all inference iterations.



Figure 6: Efficient use of *int*8 multiplies to compute the product  $A \times W$  requires a data layout transformation of tensor W in order to read continuous bits. Groups of 32-bits of A are broadcasted 16 times to fill a 512-bit register which are multiplied by groups of 512-bits from tensor W'.

The register with the first 4 *int*8 values (copied 16 times) of *A* is multiplied by the 64 *int*8 values (512-bits) of *W*' and accumulated. The next 4 values in *A* are broadcasted 16 times to another register which is multiplied by the next 64 *int*8 values of *W*'. This continues until the first row of *A* is read and the results are accumulated. The outputs (after all 3 instructions of the 8-bit FMA) are the first 16 output values (requiring 512-bits at *s*32). The first row of *A* is then multiplied by the next values of *W*' resulting in the next 16 values of the output.

The Intel Xeon Scalable processors have up to 32 registers. When executing in 512-bit register port scheme on processors with two FMA units<sup>3</sup>, Port 0 FMA has a latency of 4 cycles and Port 5 FMA has a latency of 6 cycles. The instructions used for deep learning workloads at *int*8 support bypass and have a latency of 5 cycles for both ports 0 and 5 (see Section 15.17). In practice, multiple rows of W' are loaded to multiple registers to hide these latencies.

# 16-bit functions for training

Intel MKL-DNN support for 16-bit multiplies using the AVX512\_VNNI instructions is designed as future work when the instructions become available. Nevertheless, researchers have already <u>shown</u> training of various CNNs models using 16-bit multiplies with 32-bit accumulates by taking advantage of the AVX512\_4VNNI instructions (also known as QVNNI, available on some Intel<sup>®</sup> Xeon<sup>®</sup> Phi<sup>™</sup> processors), specifically the AVX512\_4VNNI VP4DPWSSD instruction (similar to the AVX512\_VNNI VPDPWSSD instruction discussed earlier).

These researchers matched the fp32 statistical performance of ResNet-50, GoogLeNet-v1, VGG-16 and AlexNet with the same number of iterations as fp32 models without changing the hyper-parameters. They use s16 to store the activations, weights, and gradients, and also keep a master-copy of the fp32 weights for the weights updates that gets quantized back to s16 after each iteration. They use quantization factors that are powers-of-two which facilitates managing the quantization / dequantization factors through tensor multiplies.

# **Enabling Lower Numerical Precision in the Frameworks**

The popular frameworks enable users to define their model without writing all the function definitions themselves. The details on the implementations of the various functions can be hidden from the framework users. These implementations are done by framework developers. This section explains the modifications required at the framework level to enable lower numerical precision.

Quantizing the weights is done before inference starts. Quantizing the activations efficiently requires precomputing the quantization factors. The activation quantization factor are precomputed usually sampling the validation dataset to find the range as described above. Values in the test dataset outside this range are saturated to the range. For negative activation values, the range before saturation could be relaxed to  $-\frac{128R_{a'}}{127}$  in order to use the s8 = -128 value, where  $R_{a'}$  is maximum absolute value of these activations. These scalars are then written to a file.

<sup>&</sup>lt;sup>3</sup> Two 512-bit FMA units computing in parallel per core are available in Intel Xeon Platinum processors, Intel Xeon Gold processors 6000 series and 5122. Other Intel Xeon Scalable processor SKUs have one FMA unit per core.

#### 8-bit quantization of activations or inputs with negative values

Quantizing activations or input values with negative values can be implemented at the framework level as follows.  $Q_{a'} = \frac{127}{R_{a'}}$  is the quantization factor for activations with negative values. The *s*8 quantized format is  $a_{s8} = ||Q_{a'}a_{f32}|| \in [-128, 127]$ , where the function  $|| \cdot ||$  rounds to the nearest integer. However, the activation must be in *u*8 format to take advantage of the AVX512\_VNNI VPMADDUBSW instruction or the AVX512\_VNNI VPDPBUSD instruction (both described in the section *Lower numerical precision with Intel Xeon Scalable processors*). Therefore, all values in  $a_{s8}$  are shifted by K = 128 to be non-negative:

$$a_{u8} = a_{s8} + K\mathbf{1} \in [0, 255]$$

where 1 is a vector of all 1s, and the bias  $b_{f32}$  is modify as

$$b'_{f32} = b_{f32} - \frac{K}{Q'_a} W_{f32} \mathbf{1}$$

The methodology to quantize the weights and modified bias is the same as before:

$$W_{s8} = \|Q_w W_{f32}\| \in [-128, 127]$$
$$b'_{s32} = \|Q_{a}, Q_w b'_{f32}\| \in [-2^{31}, 2^{31} - 1]$$

The affine transformation using 8-bit multipliers and 32-bit accumulates results in

$$\boldsymbol{x}_{s32} = \boldsymbol{W}_{s8}\boldsymbol{a}_{u8} + \boldsymbol{b}'_{s32} \approx Q_w \boldsymbol{W}_{f32} (Q_{a'}\boldsymbol{a}_{f32} + K\mathbf{1}) + Q_w Q_{a'} (\boldsymbol{b}_{f32} - \frac{K}{Q'_a} \boldsymbol{W}_{f32}\mathbf{1}) = Q_{a'} Q_w (\boldsymbol{W}_{f32}\boldsymbol{a}_{f32} + \boldsymbol{b}_{32}) = Q_{a'} Q_w \boldsymbol{x}_{f32}$$

where

$$x_{f32} = W_{f32}a_{f32} + b_{32} \approx \frac{1}{Q_{a'}Q_{w}}x_{s32} = Dx_{s32}$$

where  $D = \frac{1}{Q_{a'}Q_w}$  is the dequantization factor.

When the input signal is already in u8 format (e.g., RGB images) but a preprocessing step is required to subtract the mean signal, the above equations can be used where K is the mean,  $a_{u8}$  is the input signal (not pre-processed), and  $Q_{ar} = 1$ .

Researchers often keep the first convolution layer in fp32 format and do the other convolutional layers in *int*8 (see *Brief History of Lower Numerical Precision in Deep Learning* section for examples). We observe that using these quantization techniques enables the use of all convolution layers in *int*8 with no significant decrease in statistical accuracy.

To recap, to use activations with negative values, the activations are quantized to s8 format and then shifted by K = 128 to u8 format. The only additional change is to modify the bias:  $b'_{f32} = b_{f32} - \frac{K}{Q'_a} W_{f32} \mathbf{1}$ . For a convolution layer the product  $W_{f32} \mathbf{1}$  is generalized to equal the sum over all the values of  $W_{f32}$  along all dimensions except the dimension shared with  $b_{f32}$ . See Appendix A for details.

#### **Fused quantization**

Fused quantization improves performance by combining dequantization and quantization as follows so there is no need to convert to fp32. The activation at layer l + 1 is:

$$\boldsymbol{a}_{f32}^{(l+1)} = g\left(\boldsymbol{x}_{f32}^{(l)}\right) = g\left(D^{(l)}\boldsymbol{x}_{s32}^{(l)}\right)$$

where  $g(\cdot)$  is a non-linear activation function. Assuming the ReLU activation function, the activation can be expressed in u8 format as

$$\boldsymbol{a}_{u8}^{(l+1)} = \left\| Q_a^{(l+1)} \boldsymbol{a}_{f32}^{(l+1)} \right\| = \left\| Q_a^{(l+1)} D^{(l)} \max\left(0, \boldsymbol{x}_{s32}^{(l)}\right) \right\|$$

where the product  $Q_a^{(l+1)}D^{(l)}$  enables computing the next layer's quantized activation in u8 format without computing the fp32 representation.

When  $g(\cdot)$  is the ReLU function (as in the equations below) and  $Q \ge 0$  (as is always the case for the quantization factors), the following property holds:

$$Qg\left(D^{(l)}\boldsymbol{x}_{s32}^{(l)} + D^{(h)}\boldsymbol{x}_{s32}^{(h)}\right) = g\left(QD^{(l)}\boldsymbol{x}_{s32}^{(l)} + QD^{(h)}\boldsymbol{x}_{s32}^{(h)}\right)$$

This property is useful for models with skip connections such as ResNet where a skip connection branch may have dependencies on various activations. As an example, and using the nomenclature by the ResNet-50 author in Caffe's <u>deploy.prototxt</u> (see Figure 7), the quantized input activation in layer  $res2b_branch2a$  (abbreviated as 2b2a in the equations below) is

$$\begin{aligned} \boldsymbol{a}_{u8}^{(2b2a)} &= Q_a^{(2b2a)} g\left( D^{(2a1)} \boldsymbol{s}_{32}^{(2a1)} + D^{(2a2c)} \boldsymbol{s}_{32}^{(2a2c)} \right) \\ &= g\left( Q_a^{(2b2a)} D^{(2a1)} \boldsymbol{s}_{32}^{(2a1)} + Q_a^{(2b2a)} D^{(2a2c)} \boldsymbol{s}_{32}^{(2a2c)} \right) \end{aligned}$$

where  $a_{u8}^{(2b2a)} \in [0, 127]$  (instead of [0, 255]) because  $Q_a^{(2b2a)}D^{(2a1)}s_{32}^{(2a1)} \in [-128, 127]$  is in s8 format because the product comes before the ReLU function and  $Q_a^{(2b2a)} = \frac{127}{R_a^{(2b2a)}}$  is the

quantization factor. Following this procedure, it is shown in Appendix B that the activation  $a_{u8}^{(2c2a)}$  depends on  $s_{32}^{(2a1)}$ ,  $s_{32}^{(2a2c)}$  and  $s_{32}^{(2b2c)}$ . Similarly, the activation  $a_{u8}^{(3ca)}$  depends on  $s_{32}^{(2a1)}$ ,  $s_{32}^{(2a2c)}$ ,  $s_{32}^{(2a2c)}$ ,  $s_{32}^{(2b2c)}$  and  $s_{32}^{(2c2c)}$ .





## **Batch normalization**

A batch normalization (BN) *inference* layer is not needed as it can be absorbed by its preceding layer by scaling the weight values and modifying the bias. This technique only works for inference and is not unique to lower numerical precision. It can be implemented at the framework level instead of Intel MKL-DNN. BN is usually applied after the affine transformation x = Wa + b and before the activation function (details in the original BN paper). BN normalizes x to be zero mean and unit norm, and then scales and shifts the normalized vector by  $\gamma$  and  $\beta$ , respectively, which are parameters also learned during training. During a training iteration, x is normalized using the mini-batch statistics. For inference, the mean E and variance V of x are precomputed using the statistics of the entire training dataset

or a variant such as a running average of these statistics computed during training. During inference, the BN output y is:

$$\mathbf{y} = BN(\mathbf{x}) = \gamma \frac{\mathbf{x} - E\mathbf{1}}{V} + \beta \mathbf{1} = \gamma \frac{W\mathbf{a} + \mathbf{b} - E\mathbf{1}}{V} + \beta \mathbf{1} = \frac{\gamma}{V}W\mathbf{a} + \frac{\gamma}{V}\mathbf{b} + \frac{\beta - \gamma E}{V}\mathbf{1} = W'\mathbf{a} + \mathbf{b}'$$

where  $W' = \frac{\gamma}{V}W$  and  $b' = \frac{\gamma}{V}b + \frac{\beta - \gamma E}{V}$ **1**. That is, during inference the BN layer can be replaced by adjusting weights and bias in the preceding convolutional or fully connected layer.

## Frameworks

Intel enabled 8-bit inference in <u>Intel<sup>®</sup> Distribution of Caffe</u>\*. Intel's DL <u>Inference Engine</u>, Apache\* <u>MXNet</u>\*, and <u>TensorFlow</u>\* optimizations are expected to be available in Q2 2018. All these 8-bit optimizations are currently limited to CNN models. RNN models, 16-bit training enabling, and other frameworks will follow later in 2018.

In the Intel Distribution of Caffe, the *model.prototxt* file is modified to include the precomputed scalars as shown in Figure 8. Currently, the Intel Optimization of Caffe can provide the quantization factor as either a power-of-two or as regular fp32 value, and can use either 1 quantization factor per tensor or 1 per channel. Those quantization factors are computed using a sampling tool built into the Intel Distribution of Caffe.

```
layer {
  name: "conv2" type: "Convolution"
  ...
  quantization_param {
    precision: DYNAMIC_FIXED_POINT
    bw_layer_in: 8 // input bit-width
    bw_layer_out: 8 // output bit-width
    bw_params: 8 // weights bit-width
    fl_layer_in: 0 // input fraction length
    fl_layer_out: -2 // output fraction length
    fl_params: 8 // weights fraction length
    }
}
```

Figure 8: Quantization factors are added to the model.prototxt file. Image credit to Haihao Shen.

Intel's Deep Learning Inference Engine is part of Intel's Deep Learning Deployment Toolkit and Intel® Computer Vision SDK. It's available on Linux\* and Windows\* OS and initially supports models trained from the Caffe, MXNet, and TensorFlow frameworks. The Inference Engine facilitates deployment of DL solutions by delivering a unified API for various hardware backends: Intel Xeon processors with Intel AVX-2 and Intel AVX-512, Intel Atom® processors, Intel® HD Graphics, and Intel® Arria® 10 (Intel® A10) discrete cards at various numerical precisions depending on the hardware. The inference engine will support 8-bit inference on Intel Xeon Scalable processors starting in Q2 2018.

TensorFlow already supports 8-bit inference and various quantization <u>methods</u>. It can dynamically compute the scale or collect statistics during training or calibration phase to then assign a quantization factor. TensorFlow's graph, which includes these scalars, is written to a file. The graph with the respective scalars is quantized and ran during inference. TensorFlow supports two methods for quantization. One method is similar to Intel MKL-DNN by setting the min and max as additive inverses. The other uses arbitrary values for min and max that need an offset plus scale (not supported in Intel MKL-DNN). See Pete Warden's <u>blog</u> for more details but note that the blog is outdated as it does not contain all the ways to quantize in TensorFlow.

Another <u>tool</u> of TensorFlow is retraining or fine-tuning at lower numerical precision. Finetuning can improve the statistical performance. Given a model that is trained at fp32, after its weights are quantized, the model is then fine-tuned with the quantized weights and the weights are re-quantized after each training iteration.

<u>GemmLowP</u> is a Google library adopted in TensorFlow Lite\*. It uses u8 multiplies, where  $f32 = D \times (u8 - K)$ , K is an u8 value that maps to fp32 = 0, and D > 0 is the dequantization factor.

The Apache MXNet branch currently does not support 8-bit. However, a branch by one of the main MXNet contributors supports 8-bit inference. In that branch, there are two methods to quantize the values: one where the min value is mapped to 0 and the max value to 255 (note that zero does not map to an exact value); and, another one where the max of the absolute value is mapped to either -127 or 127 (note that zero maps to zero—similar to Intel MKL-DNN). The main difference with the presented approached is that the scalars in this MXNet branch are not precomputed. Rather, they are computed during the actual inference steps which reduces the benefits of lower numerical precision. In that branch, the scalars for the activations are computed by multiplying the scalars from the inputs with the scalars from the weights: activation-scalar = input-scalar \* quantized-weight; and activation = activation-scalar \* quantized-activation; input, weights, activations, and scalars are in *fp*32 format, quantized-input and quantized-weights are in *int*8 format, and quantized-activations are only dequantized when encountering an *fp*32 layers (e.g., softmax).

TensorRT <u>quantizes</u> to *s*8 format similar to Intel MKL-DNN with the addition of finding a tighter range by minimizing the KL divergence between the quantized and reference distributions.

The TPU team <u>claims</u> that TPUs which uses *int*8 multiplies are being used across a variety of models including LSTM models. The software stack <u>translates</u> API calls from TensorFlow graphs into TPU instructions.

Caffe2's docs <u>state</u> that there is "flexibility for future directions such as quantized computation," but currently no plans for quantization have been disclosed.

PyTorch has a <u>branch</u> that offers various options to quantize but there is no discussion on which is better.

Microsoft introduced <u>Project Brainwave</u><sup>\*</sup> using a custom 8-bit floating point format (ms-fp8) that runs on Intel<sup>®</sup> Stratix<sup>®</sup> 10 FPGAs. The details of this format, quantization techniques, or framework implementation has not been disclosed. Project Brainwave supports CNTK<sup>\*</sup> and TensorFlow and plans to support many others by converting models trained in popular frameworks to an internal graph-based intermediate representation.

## Model and graph optimizations

Model optimizations can further improve inference performance. For example, in ResNet, the stride operation can be moved to an earlier layer without modifying the end result and reducing the number of operations as shown in Figure 9. This modification applies to both 8-bit and 32-bits.



Figure 9: The stride 2 shown on the layers on the left blocks can be moved to an earlier layer during inference which reduces the number of operations and does not modify the result. Illustration courtesy of Eden Segal and Etay Meiri.

## Conclusion

Lower numerical precision inference and training can improve the computational performance with minimal or no reduction in statistical accuracy. Intel is enabling 8-bit precision for inference on the current generation of Intel Xeon Scalable processors. Intel is also enabling 8bit precision for inference and 16-bit precision for training on future microarchitectures in both hardware and software enabling compilers, the Intel MKL-DNN library and popular deep learning frameworks.

## Acknowledgements

A special thanks to the framework optimization team leads, and the Intel Xeon processor architects and researchers for the useful discussions including Israel Hirsh, Alex Heinecke, Vadim Pirogov, Frank Zhang, Rinat Rappoport, Barak Hurwitz, Dipankar Das, Dheevatsa Mudigere, Naveen Mellempudi, Dhiraj Kalamkar, Bob Valentine, AG Ramesh, Nagib Hakim as well as the wonderful reviewers R. Chase Adams, Nikhil Murthy, Banu Nagasundaram, Todd Wilson, Alexis Crowell, and Emily Hudson.

## **About the Authors**

**Andres Rodriguez**, PhD, is a Sr. Principal Engineer working with the Data Center Group (DCG) and Artificial Intelligence Products Group (AIPG) where he designs AI solutions for Intel's customers and provides technical leadership across Intel for AI products. He has 13 years of experience working in AI. Andres received his PhD from Carnegie Mellon University for his research in machine learning. He holds over 20 peer reviewed publications in journals and conferences, and a book chapter on machine learning.

**Eden Segal**, is a software developer at the Pre-Enabling team where he optimizes Deep Learning algorithms to find the peak algorithm performance on Intel processors. This knowledge is used to improve Intel's performance across the entire deep learning stack from the hardware, through the libraries and up to the deep learning framework.

**Etay Meiri**, is a software developer at the Pre-Enabling team where he optimizes Deep Learning algorithms to find the peak algorithm performance on Intel processors. This knowledge is used to improve Intel's performance across the entire deep learning stack from the hardware, through the libraries and up to the deep learning framework.

**Evarist Fomenko**, MD in Applied Mathematics, is a software development engineer in Intel MKL and Intel MKL-DNN where he designs and optimizes library functions, and interacts with internal and external teams to assist with integration. He has 5 years of experience working on hardware optimizations at Intel.

**Young Jin Kim**, PhD, is a Sr. Machine Learning Engineer with Intel's AI Products Group (AIPG) where he develops and optimizes deep learning software frameworks for Intel's hardware architecture by adopting the state-of-the-art techniques. He has over 10 years of experience

working in artificial intelligence. Young received his PhD from Georgia Institute of Technology for his research in deep learning and high-performance computing. He holds over 10 peer reviewed publications in journals and conferences.

Haihao Shen, MD in Computer Science, is a deep learning engineer in machine learning and translation team (MLT) with Intel Software and Services Group (SSG). He leads the development of Intel Distribution of Caffe, including low precision inference and model optimizations. He has 6 years of experience working on software optimization and verification at Intel. Prior to joining Intel, he graduated from Shanghai Jiao Tong University.

**Barukh Ziv**, PhD, is a Senior Software Engineer, working with pre-Enabling group in SSGi, where he designs efficient implementations of DL applications for future generations of Xeon processors. He has 2 years of experience working on DL optimizations. Barukh received his Ph. D. in Technical Sciences from Kaunas University of Technology. He holds over 5 peer reviewed publications in journals and conferences.

#### Appendix A - Details on quantization of activations or inputs with negative values

To convince the reader that these same formulas (see the section 8-bit quantization of activations or inputs with negative values) generalize to convolutional layers, we use the indices of each tensor entry and work through the steps to show the convolutional output. Let  $W_{f32} \in \Re^{0 \times C \times K \times T}$  be the weight tensor with 0 kernels or output channels, C input channels, K height, and T width. The modified bias can be represented as:

$$b_{f32}'[o_i] = b_{f32}[o_i] - \frac{K}{Q_a'} \sum_{c_i} \sum_{\kappa_i} \sum_{\tau_i} W_{f32}[o_i, c_i, \kappa_i, \tau_i] = b_{f32}[o_i] - \frac{K}{Q_a'} \overline{W}_{f32}[o_i]$$

where  $\overline{W}_{f32}[o_i] = \sum_{c_i} \sum_{\kappa_i} \sum_{\tau_i} W_{f32}[o_i, c_i, \kappa_i, \tau_i]$  and  $o_i, c_i, \kappa_i$ , and  $\tau_i$  are the indices for the kernels or output channels, input channels, kernel height, and kernel width, respectively. The convolution output can be represented as follows. Note that we assume batch size one (to omit the batch index for simplicity), the activations have been already zero padded in fp32format (or equivalently padded with K = 128 in u8 format), and the convolution stride is one.

$$\begin{aligned} x_{s32}[o_i, h_i, w_i] &= b'_{s32}[o_i] + \sum_{c_i} \sum_{\kappa_i} \sum_{\tau_i} a_{u8}[c_i, h_i + \kappa_i, w_i + \tau_i] W_{s8}[o_i, c_i, \kappa_i, \tau_i] \\ &\approx Q_{a'}Q_w b'_{f32}[o_i] + \sum_{c_i} \sum_{\kappa_i} \sum_{\tau_i} (Q_{a'}a_{f32}[c_i, h_i + \kappa_i, w_i + \tau_i] + K) Q_w W_{f32}[o_i, c_i, \kappa_i, \tau_i] \\ &= Q_{a'}Q_w \left( b_{f32}[o_i] - \frac{K}{Q'_a} \overline{W}_{f32}[o_i] \right) + \sum_{c_i} \sum_{\kappa_i} \sum_{\tau_i} Q_w Q_{a'}a_{f32}[c_i, h_i + \kappa_i, w_i + \tau_i] W_{f32}[o_i, c_i, \kappa_i, \tau_i] \\ &+ \sum_{c_i} \sum_{\kappa_i} \sum_{\tau_i} K Q_w W_{s8}[o_i, c_i, \kappa_i, \tau_i] \end{aligned}$$

$$= Q_{a'}Q_{w}b_{f32}[o_{i}] - KQ_{w}\overline{W}_{f32}[o_{i}] + \sum_{c_{i}}\sum_{\kappa_{i}}\sum_{\tau_{i}}Q_{w}Q_{a'}a_{f32}[c_{i},h_{i} + \kappa_{i},w_{i} + \tau_{i}]W_{f32}[o_{i},c_{i},\kappa_{i},\tau_{i}] + KQ_{w}\overline{W}_{f32}[o_{i}]$$

$$= Q_{a'}Q_{w}\left(b_{f32}[o_{i}] + \sum_{c_{i}}\sum_{\kappa_{i}}\sum_{\tau_{i}}a_{f32}[c_{i},h_{i} + \kappa_{i},w_{i} + \tau_{i}]W_{f32}[o_{i},c_{i},\kappa_{i},\tau_{i}]\right)$$

$$= Q_{a'}Q_{w}x_{f32}[o_{i},h_{i},w_{i}]$$

## Appendix B – Details on fused quantization with skip connections

The activation inputs to the layers marked by the blue arrow in Figure 7 are as follows where layer  $res2b_branch2a$  is abbreviated as 2b2a in the equations below with similar abbreviations for the other layers.

$$\begin{aligned} \boldsymbol{a}_{u8}^{(2b2a)} &= Q_a^{(2b2a)} \boldsymbol{a}_{f32}^{(2b2a)} \\ &\approx Q_a^{(2b2a)} g \left( D^{(2a1)} \boldsymbol{s}_{32}^{(2a1)} + D^{(2a2c)} \boldsymbol{s}_{32}^{(2a2c)} \right) \\ &= g \left( Q_a^{(2b2a)} D^{(2a1)} \boldsymbol{s}_{32}^{(2a1)} + Q_a^{(2b2a)} D^{(2a2c)} \boldsymbol{s}_{32}^{(2a2c)} \right) \end{aligned}$$

$$\begin{aligned} \boldsymbol{a}_{u8}^{(2c2a)} &= Q_a^{(2c2a)} \boldsymbol{a}_{f32}^{(2c2a)} \\ &\approx Q_a^{(2c2a)} g\left(\boldsymbol{a}_{f32}^{(2b2a)} + D^{(2b2c)} \boldsymbol{s}_{32}^{(2b2c)}\right) \\ &\approx Q_a^{(2c2a)} g\left(g\left(D^{(2a1)} \boldsymbol{s}_{32}^{(2a1)} + D^{(2a2c)} \boldsymbol{s}_{32}^{(2a2c)}\right) + D^{(2b2c)} \boldsymbol{s}_{32}^{(2b2c)}\right) \\ &= g\left(g\left(Q_a^{(2c2a)} D^{(2a1)} \boldsymbol{s}_{32}^{(2a1)} + Q_a^{(2c2a)} D^{(2a2c)} \boldsymbol{s}_{32}^{(2a2c)}\right) + Q_a^{(2c2a)} D^{(2b2c)} \boldsymbol{s}_{32}^{(2b2c)}\right) \end{aligned}$$

$$\begin{aligned} \boldsymbol{a}_{u8}^{(3a2a)} &= Q_a^{(3a2a)} \boldsymbol{a}_{f32}^{(3a2a)} \\ &\approx Q_a^{(3a2a)} g\left(\boldsymbol{a}_{f32}^{(2c2a)} + D^{(2c2c)} \boldsymbol{s}_{32}^{(2c2c)}\right) \\ &\approx Q_a^{(3a2a)} g\left(g\left(g\left(D^{(2a1)} \boldsymbol{s}_{32}^{(2a1)} + D^{(2a2c)} \boldsymbol{s}_{32}^{(2a2c)}\right) + D^{(2b2c)} \boldsymbol{s}_{32}^{(2b2c)}\right) + D^{(2c2c)} \boldsymbol{s}_{32}^{(2c2c)}\right) \\ &= g\left(g\left(g\left(Q_a^{(3a2a)} D^{(2a1)} \boldsymbol{s}_{32}^{(2a1)} + Q_a^{(3a2a)} D^{(2a2c)} \boldsymbol{s}_{32}^{(2a2c)}\right) + Q_a^{(3a2a)} D^{(2b2c)} \boldsymbol{s}_{32}^{(2b2c)}\right) \\ &+ Q_a^{(3a2a)} D^{(2c2c)} \boldsymbol{s}_{32}^{(2c2c)}\right) \end{aligned}$$

$$\begin{aligned} \boldsymbol{a}_{u8}^{(3a1)} &= Q_a^{(3a1)} \boldsymbol{a}_{f32}^{(3a1)} \\ &\approx Q_a^{(3a1)} g \left( \boldsymbol{a}_{f32}^{(2c2a)} + D^{(2c2c)} \boldsymbol{s}_{32}^{(2c2c)} \right) \\ &\approx Q_a^{(3a1)} g \left( g \left( g \left( D^{(2a1)} \boldsymbol{s}_{32}^{(2a1)} + D^{(2a2c)} \boldsymbol{s}_{32}^{(2a2c)} \right) + D^{(2b2c)} \boldsymbol{s}_{32}^{(2b2c)} \right) + D^{(2c2c)} \boldsymbol{s}_{32}^{(2c2c)} \right) \end{aligned}$$

$$= g \left( g \left( Q_a^{(3a1)} D^{(2a1)} s_{32}^{(2a1)} + Q_a^{(3a1)} D^{(2a2c)} s_{32}^{(2a2c)} \right) + Q_a^{(3a1)} D^{(2b2c)} s_{32}^{(2b2c)} \right) + Q_a^{(3a1)} D^{(2c2c)} s_{32}^{(2c2c)} \right)$$

#### **Notices and Disclaimers:**

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit <u>www.intel.com/benchmarks.</u>

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Intel, the Intel logo, Xeon, Xeon Phi and Intel Nervana are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

\*Other names and brands may be claimed as the property of others

© 2018 Intel Corporation. All rights reserved.