What’s New in LLVM* for 4th Gen Intel® Xeon® Scalable Processors & Intel® Max Processors

Get the Latest on All Things CODE

author-image

By

@IntelDevTools

Story at a Glance

  • On January 12, 2023, LLVM* 15.0.6 was released, continuing Intel’s long history of working with the LLVM community to contribute innovative and performance-enhancing optimizations to the LLVM open source project.
  • The LLVM optimizations targeted the newly launched 4th Gen Intel® Xeon® Scalable Processors and Intel® Xeon CPU Max Series (formerly code named Sapphire Rapids) and consist of:
    • Instruction Set Architecture (ISA) support for Intel® Advanced Matrix Extensions (Intel® AMX), Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with FP16, Intel® Advanced Vector Extensions (Intel® AVX) with Vector Neural Network Instructions (VNNI), User Interrupts (UINTR), and more
    • The new type _Float16, which was extended for all x86 targets
    • Enhanced tile automatic configuration and register allocation for Intel AMX intrinsics, stabilizing the programming model
    • Introduction of a new attribute, general-regs-only, with UINTR to improve performance in interrupt handling
    • Improved performance of byte and short vectors dot product
  • Applications that take advantage of the extended new type, interface and intrinsics, and enhanced vectorizations can realize performance gains for workloads in 5G, deep learning, system software, and many more.

Intel® AVX-512 with an FP16 Instruction Set

Intel AVX 512 with FP16 is a comprehensive floating-point instruction set extension for the FP16 data type, comparable to FP32 or FP64 support. It supports the complete arithmetic operations with IEEE 754 binary 16 floating-point type.

Benefit: Compared to FP32 and FP64 floating-point formats, FP16 gives increased execution throughput and reduced storage by reducing the data range and precision. The programmer needs to decide whether the FP16 type is suitable for their applications.

Use Cases:
Users can use the new _Float16 type like other floating-point types such as float and double when the option -march=sapphirerapids is specified. Users can take advantage of vector instructions through either compiler auto-vectorization or hundreds of newly added intrinsics.

An Example of a Scalar Arithmetic Operation

_Float16 foo(_Float16 a, _Float16 b) {
  return a + b;
}

Compiled with the command clang -S -march=sapphirerapids -O2

foo:                                    # @foo
        vaddsh  xmm0, xmm0, xmm1
        ret

The example demonstrates how _Float16 can be used like a traditional float or double. AI workloads can benefit from the usability of the new type.

An Example of Compiler Auto-Vectorization

void foo(_Float16 *a, _Float16 *b, _Float16 *restrict c) {
  for (int i = 0; i < 32; ++i)
    c[i] = a[i] + b[i];
}

Compiled with the command clang -S -march=sapphirerapids -O2 -mprefer-vector-width=512

foo:                                    # @foo
        vmovups zmm0, zmmword ptr [rdi]
        vaddph  zmm0, zmm0, zmmword ptr [rsi]
        vmovups zmmword ptr [rdx], zmm0
        vzeroupper
        ret

Auto-vectorization supports _Float16 type too. Benefitting from the half size compared to a float type, the vector instructions provide improved throughput in the same vector width.

An Example of Using New Intrinsics

#include <immintrin.h>

__m512h foo(__m512h a, __m512h b) {
  return _mm512_add_round_ph(a, b, _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
}

Compiled with the command clang -S -march=sapphirerapids -O2

foo:                                    # @foo
        vaddph  zmm0, zmm0, zmm1, {rz-sae}
        ret

The newly added intrinsics are much like the existing ones in both naming convention and arguments. Three new vector types __m128h, __m256h, and __m512h were introduced for these new intrinsics. To learn more about intrinsics, see the Intel® Intrinsics Guide.

More information of the type and ISA use can be found in the Technology Guide.

_Float16 Support for Targets without the Intel® AVX-512 with FP16 Feature

We expanded LLVM compiler support of the _Float16 type to all modern x86 targets that support Intel® Streaming SIMD Extensions 2 (Intel® SSE2) through software emulation.

Benefit: Users are now able to develop and run their applications with a _Float16 type across various Intel architecture systems even though they don’t support features in Intel AVX-512 with FP16. And they will get more application performance when deployed to a 4th gen Intel Xeon Scalable processor through recompilation.

Use Case: On systems without features in Intel AVX-512 with FP16, the compiler relies on new libgcc (version 12 and above) or compiler-rt (version 14 and above) for type conversion between _Float16 and float. Alternatively, users may use their own libraries. The linker reports a failure if none of these libraries are available.

To accelerate the emulation on previous-generation target systems, we provide vectorization support for Intel-based systems that support F16C. Note that those Intel AVX-512 with FP16 intrinsics are not supported on these previous generation platforms.

An Example of a Scalar Arithmetic Operation

_Float16 foo(_Float16 a, _Float16 b) {
  return a + b;
 }

Compiled with the command clang -S -msse2 -O2

foo:                                    # @foo
        push    rax
        movss   dword ptr [rsp + 4], xmm0       # 4-byte Spill
        movaps  xmm0, xmm1
        call    __extendhfsf2@PLT
        movss   dword ptr [rsp], xmm0           # 4-byte Spill
        movss   xmm0, dword ptr [rsp + 4]       # 4-byte Reload
        call    __extendhfsf2@PLT
        addss   xmm0, dword ptr [rsp]           # 4-byte Folded Reload
        call    __truncsfhf2@PLT
        pop     rax
        ret

With the -msse2 option, the compiler generates a function call to library routine __extendhfsf2 to extend the _Float16 type to the float type, emulates the addition with the float type, and then truncates it back to the _Float16 type through a call to __truncsfhf2.

An Example of Compiler Auto-Vectorization

void foo(_Float16 *a, _Float16 *b, _Float16 *restrict c) {
  for (int i = 0; i < 8; ++i)
    c[i] = a[i] + b[i];
}

Compiled with the command clang -S -mf16c -O2

foo:                                    # @foo
        vcvtph2ps       ymm0, xmmword ptr [rsi]
        vcvtph2ps       ymm1, xmmword ptr [rdi]
        vaddps  ymm0, ymm1, ymm0
        vcvtps2ph       xmmword ptr [rdx], ymm0, 4
        vzeroupper
        ret

With the -mf16c option, the vectorization is able to take advantage of F16C instructions for type conversion and generates assembly with better performance.

Intel® Advanced Matrix Extensions

Intel AMX is a new 64-bit programming paradigm that consists of two components:

  1. A set of two-dimensional registers (tiles) that represent subarrays from a larger two-dimensional memory image
  2. An accelerator able to operate on tiles

The first implementation is called tile matrix multiply unit (TMUL). The details of the Intel AMX ISA can be found in ISA Extensions. Intel AMX helps accelerate the matrix multiply computation, which is widely used in deep learning workloads. Using these new Intel AMX instructions can provide additional performance gains.

In LLVM v13, we supported the Intel AMX programming model that facilitates developers to accelerate the matrix multiply operation.

In LLVM v14, we enhanced the back end to better support the Intel AMX programming model in C/C++ and SYCL*. This enabled SYCL and multilevel intermediate representation (MLIR), which has been adopted by popular deep learning frameworks (such as TensorFlow* and PyTorch*) to extend their languages based on the infrastructure. (A more detailed look at the Intel AMX support in SYCL and its use can be found in LLVM GitHub* from Intel and an IEEE article that goes into implementation details.)

LLVM v15 enhanced the tile autoconfiguration and register allocation, and stabilized the programming model.

Benefit: By using the new Intel AMX programming paradigm, the performance of matrix math operations is greatly accelerated on the CPU for applications such as AI and machine learning.

Use Case: At the core of HPC and AI and machine learning applications is matrix math. The extension is designed for operating on matrices with the goal of accelerating the most prominent use case for the CPU in AI and machine learning, inference, with more capabilities for training.

The following code is an example for Intel AMX use. It produced a dot product for a row of matrix A and a column of matrix B. The result was accumulated to a tile of matrix C.

TC = {TA1, TA2, TA3} x transpose { TB1, TB2, TB3 }
#include <immintrin.h>

void amx_dp(char *bufa, char *bufb, int *bufc, int tile_nr) {
    __tile1024i a = {16, 64};
    __tile1024i b = {16, 64};
    __tile1024i c = {16, 64};

    __tile_zero(&c);
    #pragma nounroll
    for (int i = 0; i < tile_nr; i++) {
        __tile_loadd(&a, bufa + 64*i, 64*tile_nr);
        __tile_loadd(&b, bufb + 1024*i, 64);
        __tile_dpbssd(&c, a, b);
    }
    __tile_stored(bufc, 64, c);
}

With the command clang -S -march=sapphirerapids -O2, it is compiled to the following assembly.

amx_dp:                                 # @amx_dp
        push    rbx
        vxorps  zmm0, zmm0, zmm0
        vmovups zmmword ptr [rsp - 64], zmm0
        mov     byte ptr [rsp - 64], 1
        mov     byte ptr [rsp - 16], 16
        mov     word ptr [rsp - 48], 64
        mov     byte ptr [rsp - 15], 16
        mov     word ptr [rsp - 46], 64
        mov     byte ptr [rsp - 14], 16
        mov     word ptr [rsp - 44], 64
        ldtilecfg       [rsp - 64]
        mov     r8w, 64
        mov     ax, 16
        tilezero        tmm0
        test    ecx, ecx
        jle     .LBB0_3
        mov     r9d, ecx
        shl     ecx, 6
        movsxd  r10, ecx
        xor     ecx, ecx
        mov     r11d, 64
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        mov     ebx, ecx
        shl     ebx, 6
        add     rbx, rdi
        tileloadd       tmm1, [rbx + r10]
        mov     ebx, ecx
        shl     ebx, 10
        add     rbx, rsi
        tileloadd       tmm2, [rbx + r11]
        tdpbssd tmm0, tmm1, tmm2
        inc     rcx
        cmp     rcx, r9
        jne     .LBB0_2
.LBB0_3:
        mov     ecx, 64
        tilestored      [rdx + rcx], tmm0
        pop     rbx
        tilerelease
        vzeroupper
        ret

The compiler automatically configures the Intel AMX physical registers and the ldtilecfg is hoisted out of the loop so that the tile configure overhead is reduced. At the end of the function, the compiler generated the tile-release instruction to release Intel AMX; this reduces the thread context switch overhead.

The Intel AMX feature was first supported in Linux* kernel v5.16-RC1. On Linux, we need to invoke a syscall to request Intel AMX from the kernel. Also, we need to enlarge the signal stack size since there are an extra 8K bytes for the signal context to save the Intel AMX registers. For details, see Intel AMX in Linux.

The following example shows a common way to initialize Intel AMX in code.

#include <err.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/auxv.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/signal.h>

#define fatal_error(msg, ...)   err(1, "[FAIL]\t" msg, ##__VA_ARGS__)

#ifndef AT_MINSIGSTKSZ
#  define AT_MINSIGSTKSZ    51
#endif

#define XFEATURE_XTILECFG   17
#define XFEATURE_XTILEDATA  18
#define XFEATURE_MASK_XTILECFG  (1 << XFEATURE_XTILECFG)
#define XFEATURE_MASK_XTILEDATA (1 << XFEATURE_XTILEDATA)
#define XFEATURE_MASK_XTILE (XFEATURE_MASK_XTILECFG | XFEATURE_MASK_XTILEDATA)

#define ARCH_GET_XCOMP_PERM 0x1022
#define ARCH_REQ_XCOMP_PERM 0x1023

static void request_perm_xtile_data() {
  unsigned long bitmask;
  long rc;

  rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
  if (rc)
    fatal_error("XTILE_DATA request failed: %ld", rc);

  rc = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_PERM, &bitmask);
  if (rc)
    fatal_error("prctl(ARCH_GET_XCOMP_PERM) error: %ld", rc);

  if (bitmask & XFEATURE_MASK_XTILE)
    printf("ARCH_REQ_XCOMP_PERM XTILE_DATA successful.\n");
}

static void setup_sigaltstack() {
  unsigned long minsigstksz, new_size;
  void *altstack;
  stack_t ss;
  int rc;

  minsigstksz = getauxval(AT_MINSIGSTKSZ);
  printf("AT_MINSIGSTKSZ = %lu\n", minsigstksz);
  /*
   * getauxval() itself can return 0 for failure or
   * success.  But, in this case, AT_MINSIGSTKSZ
   * will always return a >=0 value if implemented.
   * Just check for 0.
   */
  if (minsigstksz == 0)
    fatal_error("no support for AT_MINSIGSTKSZ");

  new_size = minsigstksz * 2;
  altstack = mmap(NULL, new_size, PROT_READ | PROT_WRITE,
                  MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
  if (altstack == MAP_FAILED)
    fatal_error("mmap() for altstack");

  memset(&ss, 0, sizeof(ss));
  ss.ss_size = new_size;
  ss.ss_sp = altstack;

  rc = sigaltstack(&ss, NULL);
  if (rc)
    fatal_error("sigaltstack failed: %d", rc);
}

void initialize_amx() {
  setup_sigaltstack();
  request_perm_xtile_data();
}

Intel® AVX and Intel AVX-512 with VNNI

Intel AVX with VNNI instructions was added into this generation as a complement to previous Intel AVX-512 with VNNI versions to accelerate convolutional neural network-based algorithms. Also, we enhanced LLVM compiler to automatically generate Intel AVX and Intel AVX 512 with VNNI instructions.

Benefit: By taking advantage of Intel AVX and Intel AVX 512 with VNNI instructions, the performance of the dot product in char or short-int vector is highly improved.

Use Case: Users need to pay attention to the sign of types in the dot product. Due to the native instruction support, the dot product of signed char and unsigned char can achieve the best performance. For other dot products of both signed chars or both unsigned chars, extra sign or zero and sign extensions are generated to extend them to short-int for further dot products. Similarly, only a signed short-int-dot-product can be accelerated on 4th gen Intel Xeon Scalable processors.

In the following example, the compiler can automatically generate an Intel AVX for VNNI instruction (vpdpbusd) from a multiply and sum reduction, with the command clang -S -march=sapphirerapids -O3

int usdot_prod_qi(unsigned char *restrict a, char *restrict b, int c) {
    int i;
    for (i = 0; i < 32; i++) {
        c +=a[i] * b[i];
    }
    return c;
}
Usdot_prod_qi:                          # @usdot_prod_qi
        vmovdqu ymm0, ymmword ptr [rdi]
        vpxor   xmm1, xmm1, xmm1
        {vex}   vpdpbusd      ymm1, ymm0, ymmword ptr [rsi]
        vextracti128    xmm0, ymm1, 1
        vpaddd  xmm0, xmm1, xmm0
        vpshufd xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        vpaddd  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        add     eax, edx
        vzeroupper
        ret

The example shows a dot product between a 32-element, unsigned char vector and a 32-element signed char vector. The performance is significantly improved compared to doing it 32 times an unsigned multiply (MUL) in a loop.

User Interrupts (UINTR)

UINTR provides a low-latency event delivery and interprocess (IPC) communication mechanism. These events can be delivered directly to the user space without a transition to the kernel.

Benefit: System software developers can benefit from the efficiency of the interprocedure communication to improve the performance of their workload. For more information, see the benchmarks for IPC and uintr.

Use Case: Starting with LLVM v14, the compiler supports uintr assembly, and the intrinsic and compiler flag -muintr. The -march=sapphirerapids option also enables the UINTR feature. The “(interrupt)” attribute can be used to compile a function as a user-interrupt handler. In conjunction with the ‘-muintr’ flag, the compiler:

  • Generates the entry and exit sequences for the UINTR handler
  • Handles the saving and restoring of registers
  • Calls uiret to return from a user-interrupt handler

UINTR-related compiler intrinsic instructions are declared in <x86gprintrin.h>:

  • - _clui() - Clears the user interrupt flag (UIF).
  • - _stui() - Sets the UIF.
  • - unsigned char _testui() - Stores the current UIF an in unsigned 8-bit integer dst.
  • - _senduipi(unsigned __int64 __a) - Sends user interprocessor interrupts specified in unsigned 64-bit integer __a.

The following is an example code for UINTR handler.

#include <unistd.h>
#include <x86gprintrin.h>

unsigned int uintr_received;
void
__attribute__((interrupt))
__attribute__((target("general-regs-only")))
uintr_handler(struct __uintr_frame *ui_frame,
              unsigned long long vector) {
  static const char print[] = "Received User Interrupt handler\n";

  write(STDOUT_FILENO, print, sizeof(print) - 1);
  uintr_received = 1;
}

With the command clang -S -march=sapphirerapids -O2, it is compiled to the following assembly.

uintr_handler(__uintr_frame*, unsigned long long):
        push    rax
        push    r11
        push    r10
        push    r9
        push    r8
        push    rdi
        push    rsi
        push    rdx
        push    rcx
        push    rax
        push    rax
        cld
        lea     rsi, [rip + uintr_handler(__uintr_frame*, unsigned long long)::print]
        mov     edx, 30
        mov     edi, 1
        call    write@PLT
        mov     dword ptr [rip + uintr_received], 1
        add     rsp, 8
        pop     rax
        pop     rcx
        pop     rdx
        pop     rsi
        pop     rdi
        pop     r8
        pop     r9
        pop     r10
        pop     r11
        add     rsp, 16
        uiret
uintr_received:
        .long   0                               # 0x0

The attribute general-regs-only should be specified because interrupt handler should not clobber SSE registers. Alternatively, we can specify option -mgeneral-regs-only when build this file. uiret instruction is generated by the compiler for the interrupt handler.

LLVM for Today’s and Tomorrow’s Development

The latest optimizations to LLVM v15.0.7 have considerably expanded the open source project, including the benefits Intel compilers offer developers. These and future enhancements will continue to streamline and simplify development and deployment of deep learning applications on current and future Intel architecture. Applications that take advantage of the extended new type, interface and intrinsics, and enhanced vectorizations can realize performance gains for workloads in 5G, deep learning, system software, and many more. The latest LLVM release (15.0.7) and previous releases can be downloaded from LLVM.

Meanwhile, we offer LLVM-based Intel compilers that are able to achieve better performance on Intel hardware through more advanced optimizations. You can experiment with the latest ISA optimizations on the free Intel® Developer Cloud, which has the latest Intel hardware and software. Additionally, you can download the latest LLVM-based compilers from Intel at Intel® toolkits.

Explore More

Intel® oneAPI Base Toolkit

Develop high-performance, data-centric applications for CPUs, GPUs, and FPGAs with this core set of tools, libraries, and frameworks including LLVM-based compilers.

Get It Now

See All Tools

On-Demand Webinars

Get Started with the Latest LLVM-based Compilers [59:07]

Watch

Driver Options, Pragmas & Intrinsics for LLVM-based Compilers from Intel [47:07]

Watch