Intel® C++ Compiler Classic Developer Guide and Reference

ID 767249
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions

Functional Overview

Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions extend Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Vector Extensions 2 (Intel® AVX2) by promoting most of the 256-bit SIMD instructions with 512-bit numeric processing capabilities.

The Intel® AVX-512 instructions follow the same programming model as the Intel® AVX2 instructions, providing enhanced functionality for broadcast, embedded masking to enable predication, embedded floating point rounding control, embedded floating-point fault suppression, scatter instructions, high speed math instructions, and compact representation of large displacement values. Unlike Intel® SSE and Intel® AVX, which cannot be mixed without performance penalties, the mixing of Intel® AVX and Intel® AVX-512 instructions is supported without penalty.

Intel® AVX-512 intrinsics are supported on IA-32 and Intel® 64 architectures built from 32nm process technology. They map directly to the new Intel® AVX-512 instructions and other enhanced 128-bit and 256-bit SIMD instructions.


Intel® AVX-512 Registers

512-bit Register state is managed by the operating system using XSAVE / XRSTOR / XSAVEOPT instructions, introduced in 45nm Intel® 64 processors (see Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B, and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A).

  • Support for sixteen new 512-bit SIMD registers in 64-bit mode (for a total of 32 SIMD registers, representing 2K of register space, ZMM0 through ZMM31).
  • Support for eight new opmask registers (k0 through k7) used for conditional execution and efficient merging of destination operands.

Intel® AVX registers YMM0-YMM15 map into Intel® AVX-512 registers ZMM0-ZMM15, very much like Intel® SSE registers map into Intel® AVX registers. In processors with Intel® AVX-512 support, Intel® AVX and Intel® AVX2 instructions operate on the lower 128- or 256-bits of the first sixteen ZMM registers.


Prefix Instruction Encoding Support for Intel® AVX-512

A new encoding prefix (referred to as EVEX) to support additional vector length encoding up to 512 bits. The EVEX prefix builds upon the foundations of VEX prefix, to provide compact, efficient encoding for functionality available to VEX encoding while enhancing vector capabilities.

The Intel® AVX-512 intrinsic functions use three C data types as operands, representing the new registers used as operands to the intrinsic functions. These are __m512, __m512d, and __m512i data types. The __m512 data type is used to represent the contents of the extended SSE register, the ZMM register, used by the Intel® AVX-512 intrinsics. The __m512 data type can hold sixteen 32-bit floating-point values. The __m512d data type can hold eight 64-bit double precision floating-point values. The __m512i data type can hold sixty-four 8-bit, thirty-two 16-bit, sixteen 32-bit, or eight 64-bit integer values.

The compiler aligns the __m512, __m512d, and __m512i local and global data to 64-byte boundaries on the stack. To align integer, float, or double arrays, use the __declspec(align) statement.


Data Types for Intel® AVX-512 Intrinsics

The prototypes for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) intrinsics are located in the zmmintrin.h header file.

To use these intrinsics, include the immintrin.h file as follows:

#include <immintrin.h>

Intel® AVX-512 intrinsics have vector variants that use __m128, __m128i, __m128d, __m256, __m256i, __m256d, __m512, __m512i, and __m512d data types.

Naming and Usage Syntax

Most Intel® AVX-512 intrinsic names use the following notational convention:

_mm512[_<maskprefix>]_<intrin_op>_<suffix>

The following table explains each item in the syntax.

_mm512 Prefix representing the size of the largest vector in the operation considering any of the parameters or the result.
<maskprefix> When present, indicates write-masked (_mask) or zero-masked (_maskz) predication.
<intrin_op> Indicates the basic operation of the intrinsic; for example, add for addition and sub for subtraction.
<suffix> Denotes the type of data the instruction operates on. The first one or two letters of each suffix denote whether the data is packed (p), extended packed (ep), or scalar (s). The remaining letters and numbers denote the type, with notation as follows:
  • s: single-precision floating point

  • d: double-precision floating point

  • i512: signed 512-bit integer

  • i256: signed 256-bit integer

  • i128: signed 128-bit integer

  • i64: signed 64-bit integer

  • u64: unsigned 64-bit integer

  • i32: signed 32-bit integer

  • u32: unsigned 32-bit integer

  • i16: signed 16-bit integer

  • u16: unsigned 16-bit integer

  • i8: signed 8-bit integer

  • u8: unsigned 8-bit integer


Programs can pack eight double precision and sixteen single precision floating-point numbers within the 512-bit vectors, as well as eight 64-bit and sixteen 32-bit integers. This enables processing of twice the number of data elements that Intel® AVX or Intel® AVX2 can process with a single instruction and four times the capabilities of Intel® SSE.

Example: Write-Masking

Write-masking allows an intrinsic to perform its operation on selected SIMD elements of a source operand, with blending of the other elements from an additional SIMD operand. Consider the declarations below, where the write-mask k has a 1 in the even numbered bit positions 0, 3, 5, 7, 9, 11, 13 and 15, and a 0 in the odd numbered bit positions.

__m512 res, src, a, b;
__mmask16  k = 0x5555;

Then, given an intrinsic invocation such as this:

res = _mm512_mask_add_ps(src, k, a, b);

every even-numbered float32 element of the result res is computed as the sum of the corresponding elements in a and b, while every odd-numbered element is passed through (i.e., blended) from the corresponding float32 element in src.

Typical write-masked intrinsics are declared with a parameter order such that the values to be blended (src in the example above) are in the first parameter, and the write mask k immediately follows this parameter. Some intrinsics provide the blended values from a different SIMD parameter, for example: _mm512_mask2_permutex2var_epi32. In this case too, the mask will follow that parameter.

Example: Zero-Masking

Zero-masking is a simplified form of write-masking where there are no blended values. Instead result elements corresponding to zero bits in the write mask are simply set to zero. Given:

res = _mm512_maskz_add_ps(k, a, b);

the float32 elements of res corresponding to zeros in the write-mask k, are set to zero. The elements corresponding to ones in k, have the expected sum of corresponding elements in a and b.

Zero-masked intrinsics are typically declared with the write-mask as the first parameter, as there is no parameter for blended values.

Example: Embedded Rounding and Suppress All Exceptions (SAE)

Embedded rounding allows the floating point rounding mode to be explicitly specified for an individual operation, without having to modify the rounding controls in the MXCSR control register. The Suppress All Exceptions feature allows signaling of FP exceptions to be suppressed.

AVX-512 provides these capabilities on most 512-bit and scalar floating point operations. An intrinsic supporting these features will typically have "_round" in its name, for example:

__m512d _mm512_add_round_pd(__m512d a, __m512d b, int rounding);

To specify round-towards-zero and SAE, an invocation would appear as follows:

__m512d res, a, b;
res = _mm512_add_round_pd(a, b, _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);

Example: Embedded Broadcasting

Embedded broadcasting allows a single value to be broadcast across a source operand, without requiring an extra instruction. The "set1" family of intrinsics represent a broadcast operation, and the compiler can embed such operations into the EVEX prefix of an AVX-512 instruction. For example,

__m512 res, a;
res = _mm512_add_ps(a, _mm512_set1_ps(3.0f));

will add 3.0 to each float32 element of a.