# Code Sample: Intel® AVX512-Deep Learning Boost: Intrinsic Functions

Published: 04/02/2019

Last Updated: 12/23/2019

Optimized for...
Operating System: Linux*
Hardware: Second generation Intel® Xeon® Scalable processor
Software:
(Programming Language, tool, IDE, Framework)
C++ Compiler version 19, Intel® Parallel Studio XE 2019
Prerequisites: Familiarity with C++

This code example shows how to take advantage of the new Intel® Advanced Vector Extensions 512 (Intel® AVX512) with Intel® Deep Learning Boost (Intel® DL Boost) in 2nd generation Intel® Xeon® Scalable processors.

The example demonstrates testing the new functionality using intrinsic functions.

## Intel® AVX-512 and Intel® DL Boost

2nd generation Intel Xeon Scalable processors include a new Intel AVX-512 extension called Intel DL Boost, which contains the Vector Neural Network instruction (VNNI). Designed to improve the throughput of integer linear algebra, this instruction can accelerate loops in some convolutional neural networks (CNNs) that perform multiplication of two 8-bit (or 16-bit) integers and accumulate the result in a 32-bit integer variable.

The VNNI feature includes a fused instruction to perform lower precision (8-bit and 16-bit) multiplies with 32-bit accumulates. This instruction replaces a sequence of three instructions that are part of the Intel AVX-512 Fused-Multiply-Add (FMA) Extensions. Figure 1 shows how the new instruction in VNNI VPDPBUSD replaces the three separate FMA instructions VPMADDUBSW, VPMADDWD, and VPADDD.

Figure 1. Intel® AVX512-DL Boost instruction VPDPBUSD replaces the three separate FMA instructions VPMADDUBSW, VPMADDWD and VPADDD to perform 8-bit multiplies with 32-bit accumulates. Image credit to Israel Hirsh and Bob Valentine.

Find a detailed description of both Intel® AVX512-DL Boost fused instruction and the FMA-based instructions, as well as the theoretical peak compute gains, in this white paper: Lower Numerical Precision Deep Learning Inference and Training.

## Code Sample

This code sample uses Intel AVX-512 intrinsics to illustrate use of both the VNNI fused instruction and the three equivalent FMA-based instructions.

Find the prototypes for Intel AVX-512 intrinsics in the immintrin.h header file:

#include <immintrin.h>

The Intel AVX-512 intrinsic functions use C data types as operands representing the 512-bit registers used in the operations. The __m512i data type can hold 64 8-bit integer values, 32 16-bit values, or 16 32-bit values:

   uint8_t  op1_int8[64];
int8_t   op2_int8[64];
int32_t  op3_int[16];
int16_t  op4_int16[32];
int32_t  result[16];

__m512i v1_int8;
__m512i v2_int8;
__m512i v3_int;
__m512i v4_int16;
__m512i vresult;

Data from memory can be loaded into the registers using the _mm512_loadu_si512 function (data does not need to be aligned on any particular boundary; otherwise, if data is aligned on a 64-byte boundary, the _mm512_load_si512 function can be used instead):

   v1_int8 =_mm512_loadu_si512(op1_int8);


Once the data is loaded, perform the dot product operation using the fused instruction vpdpbusds, which is called via the intrinsic function _mm512_dpbusds_epi32. This instruction multiplies groups of four adjacent pairs of unsigned 8-bit integers in v1_int8 with corresponding signed 8-bit integers in v2_int8, producing four intermediate signed 16-bit results. It then adds these four results with the corresponding 32-bit integer in v3_int using signed saturation, and returns the packed 32-bit results:

   // PERFORM THE DOT PRODUCT OPERATION USING FUSED INSTRUCTION
vresult = _mm512_dpbusds_epi32(v3_int, v1_int8, v2_int8);

_mm512_storeu_si512((void *) result, vresult);

printf("RESULTS USING FUSED INSTRUCTION: \n");
for (int j = 15; j >= 0; j--){
cout << result[j]<<" ";
}


   // PERFORM THE DOT PRODUCT OPERATION USING A SEQUENCE OF 3 INSTRUCTIONS

// Vertically multiply two 8-bit integers,

// Upconvert to 32-bit and horizontally add neighbors. Multiply by 1.

_mm512_storeu_si512((void *) result, vresult);

printf("RESULTS USING SEQUENCE OF 3 INSTRUCTIONS: \n");
for (int j = 15; j >= 0; j--)
cout << result[j]<<" ";