Intel® C++ Compiler Classic Developer Guide and Reference

ID 767249
Date 7/13/2023
Public
Document Table of Contents

Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Additional Instructions

Additional Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions

These additional instructions can be divided into two groups:

  • Byte and Word Instructions (BWI)
  • Doubleword and Quadword Instructions (DQI)

The group of BWI (8 and 16-bit) operations, indicated by the AVX512BW and AVX512VBMI CPUID flags, enhance small integer operations. The group of DQI (32 and 64-bit) operations indicated by the AVX512DQ and AVX512IFMA52 CPUID flags, enhance integer and floating-point operations.

An additional orthogonal capability known as Vector Length Extensions (VLE) allow AVX-512 instructions to operate on 128 or 256 bits, instead of only 512. VLE can be applied to foundation instructions, Conflict Detection Instructions (CDI), BWI, and DQI. These AVX-512 VLE are shown by the AVX512VL CPUID flag. The use of VLE extends most AVX-512 operations on XMM (128-bit, SSE) registers and YMM (256-bit, AVX) registers. The use of VLE allows the capabilities of EVEX encodings, including the use of mask registers and access to registers 16..31, to be applied to XMM and YMM registers instead of only to ZMM registers.

BWI

The BWI, indicated by the AVX512BW CPUID flag, extend write-masking and zero-masking to support smaller element sizes. The original AVX-512 Foundation instructions supported such masking with vector element sizes of 32 or 64 bits. As a 512-bit vector register could hold at most 16 32-bit elements, a write-mask size of 16 bits was sufficient.

With an instruction indicated by an AVX512BW CPUID flag, a 512-bit vector can hold 64 8-bit elements or 32 16-bit elements, so write masks must be able to hold 64 bits. To support this, two new mask types, __mmask32 and __mmask64 have been introduced, along with additional maskable intrinsics that operate on vectors of 8 and 16-bit elements. For example:

__m512i _mm512_mask_abs_epi8(__m512i src, __mmask64 k, __m512i a);

This example computes the absolute value of 8-bit elements in a corresponding to the set bits of write mask k. Elements corresponding to a zero bit in k are blended in from src.

DQI

The DQI, indicated by the AVX512DQ CPUID flag, consist of additional instructions, similar to the foundation instructions indicated by the AVX512F CPUID flag. They operate on 512-bit vectors whose elements are 16 32-bit elements or 8 64-bit elements. Some of these instructions provide new functionality such as the conversion of floating point numbers to 64-bit integers. Other instructions promote existing instructions (vxorps) to use 512-bit registers.

VLE

The VLE (noted by the CPUID flag AVX512VL) add write-masking, zero-masking, and embedded broadcast features to 128- and 256-bit vector lengths. For example:

__m256 _mm256_maskz_add_ps(__mmask8 k, __m256 a, __m256 b);

This flag adds corresponding float32 elements of a and b where the mask bit from k is set, and produces zero in the elements where the bit from k is clear.