Introduction
Intel® Xeon® Scalable processors support the increasing demands in performance with Intel® Advanced Vector Extensions 512 (Intel® AVX-512), which is a set of new instructions that can accelerate performance for demanding computational workloads.
The full specification of the Intel® AVX-512 instruction set consists of several separate subsets:
- Foundation Instructions
- Conflict Detection Instructions (CDI)
- Byte (char or int8) and Word (short or int16) Instructions
- Double-word (int32 or int) and Quad-word (int64 or long) Instructions
- Vector Length Extensions (VLE)
A more detailed description of the above subsets can be found at Improve Performance Using Vectorization and Intel® Xeon® Scalable Processors and "AVX-512 May Be a Hidden Gem" in Intel Xeon Scalable Processors.
Intel AVX-512 may increase the vectorization efficiency of our codes, both for current hardware and also for future generations of parallel hardware. This is not only because the new instructions can operate on 512-bit registers, but also because the new instructions in Intel AVX-512 offer new features that can benefit vectorization, and also because they can operate on 32 of these vector registers. In Intel Xeon Scalable processors, these new features offered by Intel AVX-512 are also available to operate in registers of different size, which make these new features available to a larger number of applications. This new functionality is offered by the additional orthogonal capability vector length extensions.
What are Vector Length Extensions?
Compared to the Intel® Advanced Vector Extensions 2 (Intel® AVX2) instruction set, Intel AVX-512 doubles the number of vector registers, and each vector register can pack twice the number of floating point or double-precision numbers. Intel® AVX2 offers 256-bit support. This means more work can be achieved per CPU cycle, because the registers can hold more data to be processed simultaneously.
However, as not all applications might benefit from the extended 512-bit registers (applications containing few vectorized loops or low trip counts, for example), the VLE orthogonal feature allows applications to take advantage of most Intel AVX-512 instructions on shorter vector lengths: 128-bit (XMM registers) and 256-bit (YMM registers), on top of 512-bit (ZMM registers). These Intel AVX-512 instructions, while running on different vector lengths, still can take advantage of the larger number of registers per core (32) and opmask registers (8).
Using the new functionality in VLE, applications have more options for optimization, because most Intel AVX-512 instructions can be used while using different vector lengths as constrained by the algorithm or data.
Example
To demonstrate the VLE orthogonal feature, we use an example code that computes the histogram of an image. The description of the algorithm and code is available in the Improve Vectorization Performance with Intel® AVX-512 tutorial, from where the sample code can be downloaded. This code example is used only for demonstration purposes.
This code shows an example of loops that are not vectorized by the Intel® C++ Compiler using the Intel AVX2 instruction set architecture (ISA), due to data dependencies caused by indirect referencing in the array computing the histogram. However, when using the Intel AVX-512 ISA, the compiler is able to vectorize these loops using instructions from the CDI subset.
The CDI subset includes functionality to detect data conflicts in vector registers, and stores this information in mask vectors, which are used in the vector computations. As explained in the tutorial, the result is that only the elements of the array without conflicts (identical grayscale values) are processed simultaneously.
In the next three experiments, this sample code will be compiled using the Intel C++ Compiler, using three different sets of options. The last experiment shows that the CDI functionality is still used, even when we direct the compiler to use YMM (256-bit) registers, instead of ZMM (512-bit) registers.
Experiment 1. Let us first compile the example code using the Intel AVX2 flag:
icpc Histogram_Example.cpp -O3 -restrict -qopt-report -qopt-report-file=runAVX2.optrpt -xCORE-AVX2 -lopencv_highgui -lopencv_core -lopencv_imgproc -o runAVX2
Notice that we have compiled the code using the Intel C++ Compiler option -qopt-report
to generate an optimization report. Here is a section of the optimization report showing that the loop computing the filter and histogram (loop in line107) has not been vectorized:
LOOP BEGIN at Histogram_Example.cpp(107,5)
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
LOOP END
Experiment 2. However, compiling the code with Intel AVX-512, adding the Intel C++ Compiler option -qopt-zmm-usage=high
to use ZMM (512-bit) registers, the optimization report shows that the loop, as expected, has been vectorized:
icpc Histogram_Example.cpp -O3 -restrict -qopt-report -qopt-report-file=runAVX512.optrpt -xCORE-AVX512 -qopt-zmm-usage=high -lopencv_highgui -lopencv_core -lopencv_imgproc -o runAVX512
LOOP BEGIN at Histogram_Example.cpp(107,5)
remark #15300: LOOP WAS VECTORIZED
LOOP END
The option -qopt-zmm-usage=high/low
used above is a new option that has been added to Intel® compilers (starting with version 18.0) to enable more flexible use of single instruction, multiple data (SIMD) vectorization on the Intel Xeon processor Scalable family. This new option should be used on top of the -xCORE-AVX512
option, as shown above, and may be used as well on top of the -qopt-report
option, which asks the compiler to generate an optimization report that helps developers to understand compiler-generated optimizations, as well as to look for more optimization opportunities. More information about these new features in Intel compilers can be found in Tuning SIMD vectorization when targeting Intel Xeon Processor Scalable Family.
By changing the option -qopt-report to -qopt-report=5
, the compiler will generate a more detailed vectorization report. In particular, we can see in the report that the compiler has generated code to use ZMM registers to store 16 floats.
LOOP BEGIN at Histogram_Example.cpp(107,5)
(…)
remark #15416: vectorization support: irregularly indexed store was generated for the variable <hist2[*(image2+position*4)]>, masked, part of index is read from memory [ Histogram_Example.cpp(122,8) ]
remark #15415: vectorization support: irregularly indexed load was generated for the variable <hist2[*(image2+position*4)]>, masked, part of index is read from memory [ Histogram_Example.cpp(122,8) ]
remark #15305: vectorization support: vector length 16
remark #15300: LOOP WAS VECTORIZED
(…)
LOOP END
Furthermore, taking a look at the assembly code (generated by the Intel C++ Compiler by using the –S option), we notice that the compiler has vectorized the computation of the histogram in the loop (line 122 in the source code):
hist2[ int(image2[position]) ]++;
by using the conflict detection instructions from the CDI subset on ZMM registers:
vpbroadcastmw2d %k2, %zmm6 #122.8
vpconflictd %zmm4, %zmm2{%k2}{z} #122.8
vpandd %zmm6, %zmm2, %zmm5 #122.8
The above example shows the expected result of compiling this code with the combination of compiler options that make the compiler use ZMM registers: -xCORE-AVX512 -qopt-zmm-usage=high
.
Experiment 3. However, using the combination of options: -xCORE-AVX512 -qopt-zmm-usage=low
(NOTE: -qopt-zmm-usage=low
is the default for -xCORE-AVX512
) tells the compiler that the program is unlikely to benefit from using ZMM registers, and most likely will use shorter registers:
icpc Histogram_Example.cpp -O3 -restrict -qopt-report -qopt-report-file=runAVX512.optrpt -xCORE-AVX512 -qopt-zmm-usage=low -lopencv_highgui -lopencv_core -lopencv_imgproc -o runAVX512
LOOP BEGIN at Histogram_Example.cpp(107,5)
remark #15300: LOOP WAS VECTORIZED
remark #15321: Compiler has chosen to target XMM/YMM vector. Try using -qopt-zmm-usage=high to override
LOOP END
Or changing the option -qopt-report
to -qopt-report=5
, the compiler gives more details:
LOOP BEGIN at Histogram_Example.cpp(107,5)
(…)
remark #15416: vectorization support: irregularly indexed store was generated for the variable <hist2[*(image2+position*4)]>, masked, part of index is read from memory [ Histogram_Example.cpp(122,8) ]
remark #15415: vectorization support: irregularly indexed load was generated for the variable <hist2[*(image2+position*4)]>, masked, part of index is read from memory [ Histogram_Example.cpp(122,8) ]
remark #15305: vectorization support: vector length 8
remark #15300: LOOP WAS VECTORIZED
(…)
LOOP END
which shows that this time the compiler is generating code that uses YMM (256-bit) registers. However, by looking at the assembly code, we notice that this loop is still being vectorized using the conflict detection instructions, but now operating on YMM registers:
vpbroadcastmw2d %k2, %ymm5 #122.8
vpconflictd %ymm3, %ymm2{%k2}{z} #122.8
vpand %ymm5, %ymm2, %ymm4 #122.8
which is the result of the VLE orthogonality being used. The VLE orthogonal feature allows this application to take advantage of the CDI on shorter vectors (256-bit YMM registers in this case), which is not possible by just using Intel AVX2 instructions, as shown in Experiment 1, above. Again, even while using YMM registers, this application still can take advantage of the larger number of registers per core (32) and opmask registers (8) present on Intel Xeon Scalable processors.
Conclusion
Intel AVX-512 is available in Intel Xeon Scalable processors. This new instruction set can accelerate performance for several workloads and usages because it offers enhanced vector processing capabilities, such as a larger number of registers per core, as well as vector operations that can operate on wider 512-bit registers.
To make the new features in Intel AVX-512 available to a larger number of applications, the new VLE orthogonal feature lets applications use most of the Intel AVX-512 instructions on shorter vector lengths, while still taking advantage of the larger number of registers. This new feature benefits applications that naturally perform SIMD operations on 128-bit or 256-bit registers.
The VLE orthogonal feature was demonstrated here using a code sample showing a clear benefit when using Intel AVX-512 instructions. Specifically, it benefits from automatic vectorization using CDI. VLE potentially broadens the applicability of these kinds of applications by letting them operate on shorter registers (XMM or YMM), while still taking advantage of the conflict detection instructions for vectorization, as well as the larger number of registers.