Introduction
immintrin debug is a new open-source library that implements the majority of modern x86 vector compiler intrinsics in C to enable source code level debug. The purpose of the library is to make debugging of complex intrinsic heavy code (like Digital Signal Processing) easier for the developer.
An Example
Single Instruction, Multiple Data (SIMD) instructions introduced in the latest CPU generations have become more and more complex. For example, look at the pseudocode describing ternary logic using Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction:
__m512i _mm512_maskz_ternarylogic_epi64 (__mmask8 k, __m512i a, __m512i b, __m512i c, int imm8)
Bitwise ternary logic that provides the capability to implement any three-operand binary function; the specific binary function is specified by value in imm8. For each bit in each packed 64-bit integer, the corresponding bit from a, b, and c are used to form a 3-bit index into imm8, and the value at that bit in imm8 is written to the corresponding bit in dst using zeromask k at 64-bit granularity (64-bit elements are zeroed-out when the corresponding mask bit is not set).
FOR j := 0 to 7
i := j*64
IF k[j]
FOR h := 0 to 63
index[2:0] := (a[i+h] << 2) OR (b[i+h] << 1) OR c[i+h]
dst[i+h] := imm8[index[2:0]]
ENDFOR
ELSE
dst[i+63:i] := 0
FI
ENDFOR
dst[MAX:512] := 0
It is certainly fun to debug an application that contains this instruction. When running a source level, single-step debug, the complex SIMD instruction intrinsic is visible as a single assembly line, taking several vectors and scalars as input and returning a new vector. The developer has to maintain a mental model of this instruction while debugging or use pen and paper to write down bitmasks and vectors that do not fit in the short-term memory.
Digital signal processing code, for example, Radio Access Network (RAN) L1, is very often implemented as sequences of Intel® Advanced Vector Extensions 2 (Intel® AVX2) or Intel AVX 512 intrinsics. It is especially hard to debug digital signal processor (DSP) code because one has to deal with both the complexity of the algorithm and the complexity of modern vector instructions.
There is a solution that assists with debugging complex code: SIMD intrinsics.
The immintrin debug Library
The immintrin debug library simplifies source-level debugging of x86 SIMD code by providing C versions of most Intel® Streaming SIMD Extensions 4 (Intel® SSE4), Intel® Advanced Vector Extensions (Intel® AVX), Intel AVX2, and Intel AVX-512 vector instructions. When running under the debugger, it is possible to single-step inside each intrinsic and examine vector content as the vectors are being processed. It also enables debug tracing from within intrinsics, for those who prefer tracing over step-by-step debugging. About 90 percent of vector functions from Intel AVX-512, Intel AVX2, Intel AVX, and SSE4.x instruction sets are supported; an example of instructions not implemented are the instructions that explicitly modify contents of CPU flag registers. When an intrinsic is not implemented in C, it falls back to using the default path with actual CPU instruction.
Immintrin debug is just a single C99 header file under a BSD license that is hosted on GitHub*. To use immintrin debug, you will need to modify your source code:
// Whatever your debug build macro name is
#ifdef _DEBUG_BUILD_
#include “immintrin_dbg.h”
#else
#include “immintrin.h”
#endif
This is the only modification you need to make in your source code. All calls to supported intrinsics are replaced automatically.
The ternary logic instruction description I used as an example is taken from the Intel® Intrinsics Guide. The Intrinsics Guide is an extremely useful website that contains a MATLAB-like code for every intrinsic. The immintrin_dbg.h functions are automatically generated from the descriptions of vector intrinsics, as manually implementing several thousand very similar functions is not reliable.
Immintrin debug also makes it possible to run/debug x86 SIMD code on older platforms where some newer vector instruction sets are not supported (though I recommend using Intel® Software Development Emulator (Intel® SDE) for this purpose.). The Intel® C compiler, the GNU Compiler Collection* (GCC), and Clang* were tested. GCC and Clang compilers respect the –march flag, so when building for targets that do not have Intel AVX-512, they do not emit code using ZMM registers. However, Intel C compiler uses ZMM0 to return 512-bit vectors from functions regardless of build target, so when using an Intel compiler, the resulting binary only works on the hardware that supports advanced vector instruction sets. Please note that when you compile with advanced optimizations enabled, for example –O3, most compilers vectorize the code, and single-step debugging becomes very difficult. So, for example, with GCC, debug builds are best done with –Og –g options.
Library Test Process and Results
The C intrinsics implementations were tested using automatically generated unit tests, which compare results of running every debug intrinsic with actual instructions on random data. While most functions passed these tests, some floating-point operations, such as reciprocal square root, behaved slightly differently in the debug library and in the form of actual instructions; results are very close but are sometimes not the same. I cannot promise 100 percent reproducibility of float and double computation—but this is a complex field by itself, so please assume the worst case—the only IEEE754 model mode is –fast-math right now.