Small Register Mode vs. Large Register Mode

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 7/18/2024

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Small Register Mode vs. Large Register Mode

Intel^® Data Center GPU Max Series products support two GRF modes: small GRF mode and large GRF mode. Each XVE has a total of 64 KB of register space In Small GRF mode, a single hardware thread can access 128 GRF registers, each of which is 64B wide. In this mode, 8 hardware threads are available per XVE. In Large GRF mode, a single hardware thread can access 256 GRF registers, each of which is 64B wide. In this mode, 4 hardware threads are available per XVE.

There are two ways to control how Intel^® Graphics Compiler (IGC) selects between these two modes: (1) command line and (2) per-kernel specification. In this chapter, we provide a step-by-step guideline on how users can provide this control for both SYCL and OpenMP applications.

GRF Mode Specification at Command Line

The -ftarget-register-alloc-mode=<arg> compiler option provides the ability to guide GRF mode selection in the IGC graphics compiler. The format of <arg> is Device0:Mode0[,Device1:Mode1...]. Currently the only supported Device is pvc. The supported modes are:

default

Provide no specification to IGC on the register file mode to select. Currently, IGC always chooses small register file mode with no specification.
small

Forces IGC to select small register file mode for ALL kernels
large

Forces IGC to select large register file mode for ALL kernels
auto

Enables IGC to select small/large GRF mode on a per-kernel basis based on heuristics

If this option is not specified, IGC selects a GRF mode on a per-kernel basis based on heuristics on Linux for the Intel^® Data Center GPU Max Series and small GRF mode otherwise.

OpenMP - GRF Mode Selection (AOT)

Following are the various commands that can be used to specify the requisite backend option during AOT compilation for OpenMP backends. Here, test.cpp can be any valid program:

icpx -fiopenmp -fopenmp-targets=spir64_gen
-ftarget-register-alloc-mode=pvc:large
-Xopenmp-target-backend "-device pvc" test.cpp
// IGC will force large GRF mode for all kernels

icpx -fiopenmp -fopenmp-targets=spir64_gen
-ftarget-register-alloc-mode=pvc:auto
 -Xopenmp-target-backend "-device pvc" test.cpp
// IGC will use compiler heuristics to pick between small and large GRF
mode on a per-kernel basis

icpx -fiopenmp -fopenmp-targets=spir64_gen
-ftarget-register-alloc-mode=pvc:small
-Xopenmp-target-backend "-device pvc" test.cpp
// IGC will automatically use small GRF mode for all kernels

OpenMP - GRF Mode Selection (JIT)

Following are the various commands that can be used to specify the requisite backend option during JIT compilation for OpenMP backends. Here, test.cpp can be any valid program:

icpx -fiopenmp -fopenmp-targets=spir64
-ftarget-register-alloc-mode=pvc:large
test.cpp
// IGC will force large GRF mode for all kernels

icpx -fiopenmp -fopenmp-targets=spir64
-ftarget-register-alloc-mode=pvc:auto
test.cpp
// IGC will use compiler heuristics to pick between small and large GRF
mode on a per-kernel basis

icpx -fiopenmp -fopenmp-targets=spir64
-ftarget-register-alloc-mode=pvc:small
test.cpp
// IGC will automatically use small GRF mode for all kernels

SYCL – GRF Mode Selection (AOT)

Following are the various commands that can be used to specify the requisite backend option during AOT compilation for SYCL backends. Here, test.cpp can be any valid SYCL program:

icpx -fsycl -fsycl-targets=spir64_gen
-ftarget-register-alloc-mode=pvc:large
-Xsycl-target-backend "-device pvc" test.cpp
// IGC will force large GRF mode for all kernels

icpx -fsycl -fsycl-targets=spir64_gen
-ftarget-register-alloc-mode=pvc:auto
-Xsycl-target-backend "-device pvc" test.cpp
// IGC will use compiler heuristics to pick between small and large GRF
mode on a per-kernel basis

icpx -fsycl -fsycl-targets=spir64_gen
-ftarget-register-alloc-mode=pvc:small
-Xsycl-target-backend "-device pvc" test.cpp
// IGC will automatically use small GRF mode for all kernels

SYCL – GRF Mode Selection (JIT)

Following are the various commands that can be used to specify the requisite backend option during JIT compilation for SYCL backends. Here, test.cpp can be any valid SYCL program:

icpx -fsycl
-ftarget-register-alloc-mode=pvc:large
test.cpp
// IGC will force large GRF mode for all kernels

icpx -fsycl
-ftarget-register-alloc-mode=pvc:auto
test.cpp
// IGC will use compiler heuristics to pick between small and large GRF
mode on a per-kernel basis

icpx -fsycl
-ftarget-register-alloc-mode=pvc:small
test.cpp
// IGC will automatically use small GRF mode for all kernels

Performance Tuning Using GRF Mode Selection

This section discusses the impact of GRF mode selection on device code performance. The examples shown in this section use the OpenMP offloading model and JIT compilation flow. Two of the main features that govern GRF mode selection are the following: (1) Register pressure for kernel code (2) Number of parallel execution threads. Following is a code snippet containing an OpenMP offload region and this will be used in the forthcoming analysis.

#pragma omp target teams distribute thread_limit(ZDIM *NX1 *NX1)
  for (int e = 0; e < nelt; e++) {
    double s_u[NX1 * NX1 * NX1];
    double s_D[NX1 * NX1];
    // SLM used for the three arrays here
    double s_ur[NX1 * NX1 * NX1];
    double s_us[NX1 * NX1 * NX1];
    double s_ut[NX1 * NX1 * NX1];

#pragma omp parallel for
    for (int inner = 0; inner < innerub; inner++) {
      int k = inner / (NX1 * NX1);
      int j = (inner - k * NX1 * NX1) / NX1;
      int i = inner - k * NX1 * NX1 - j * NX1;
      if (k == 0)
        s_D[I2(i, j)] = D[I2(i, j)];
      for (; k < NX1; k += ZDIM) {
        s_u[I3(i, j, k)] = u[I4(i, j, k, e)];
      }
    }

#pragma omp parallel for
    for (int inner = 0; inner < innerub; inner++) {
      int k = inner / (NX1 * NX1);
      int j = (inner - k * NX1 * NX1) / NX1;
      int i = inner - k * NX1 * NX1 - j * NX1;

      double r_G00, r_G01, r_G02, r_G11, r_G12, r_G22;

      for (; k < NX1; k += ZDIM) {
        double r_ur, r_us, r_ut;
        r_ur = r_us = r_ut = 0;
#ifdef FORCE_UNROLL
#pragma unroll NX1
#endif
        for (int m = 0; m < NX1; m++) {
          r_ur += s_D[I2(i, m)] * s_u[I3(m, j, k)];
          r_us += s_D[I2(j, m)] * s_u[I3(i, m, k)];
          r_ut += s_D[I2(k, m)] * s_u[I3(i, j, m)];
        }

        const unsigned gbase = 6 * I4(i, j, k, e);
        r_G00 = g[gbase + 0];
        r_G01 = g[gbase + 1];
        r_G02 = g[gbase + 2];
        s_ur[I3(i, j, k)] = r_G00 * r_ur + r_G01 * r_us + r_G02 * r_ut;
        r_G11 = g[gbase + 3];
        r_G12 = g[gbase + 4];
        s_us[I3(i, j, k)] = r_G01 * r_ur + r_G11 * r_us + r_G12 * r_ut;
        r_G22 = g[gbase + 5];
        s_ut[I3(i, j, k)] = r_G02 * r_ur + r_G12 * r_us + r_G22 * r_ut;
      }
    }

#pragma omp parallel for
    for (int inner = 0; inner < innerub; inner++) {
      int k = inner / (NX1 * NX1);
      int j = (inner - k * NX1 * NX1) / NX1;
      int i = inner - k * NX1 * NX1 - j * NX1;
      for (; k < NX1; k += ZDIM) {
        double wr = 0.0;
        for (int m = 0; m < NX1; m++) {
          double s_D_i = s_D[I2(m, i)];
          double s_D_j = s_D[I2(m, j)];
          double s_D_k = s_D[I2(m, k)];
          wr += s_D_i * s_ur[I3(m, j, k)] + s_D_j * s_us[I3(i, m, k)] +
                s_D_k * s_ut[I3(i, j, m)];
        }
        w[I4(i, j, k, e)] = wr;
      }
    }
  }

There are two parameters that can be modified here: (1) Unroll factor of inner loop in line number 36 (2) Number of OpenMP teams specified in line number 1. The unroll factor can be used to control register pressure. Greater the unroll factor, higher will be the register pressure. Number of OpenMP teams can be used to control the number of parallel threads. In this discussion, kernel execution time on the device is used as metric for performance. Actual numbers are not provided as they may vary based on user environments and device settings. Following are some observations:

When unrolling is turned off, use of small GRF mode is found to provide
better performance. This implies that the register pressure is not high
enough to get any benefits out of using large GRF mode.

When unrolling is turned on, use of large GRF mode is found to provide
better performance. This implies that the register pressure is high and
large GRF mode is required to accommodate this pressure.

Increase in number of teams tends to result in better performance for
larger (higher register pressure) workloads when using small GRF mode.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Small Register Mode vs. Large Register Mode

GRF Mode Specification at Command Line

Performance Tuning Using GRF Mode Selection