Small Register Mode vs. Large Register Mode

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 3/31/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-63717C53-5F77-455C-A0B2-BFB8F91473E9

View Details

Small Register Mode vs. Large Register Mode

There are two general-purpose register (GRF) modes available in Intel^® Data Center GPU Max Series – small GRF mode and large GRF mode. There are two ways to control how Intel^® Graphics Compiler (IGC) selects between these two modes: (1) command line and (2) per-kernel specification. In this chapter, we provide a step-by-step guideline on how users can provide this control for both SYCL and OpenMP backends.

GRF - An Overview

Intel^® Data Center GPU Max Series products support two GRF modes: small GRF mode and large GRF mode. Each Execution Unit (EU) has a total of 64 KB of storage available in registers. In Small GRF mode, a single hardware thread in an EU can access 128 GRF registers, each of which is 64B wide. In this mode, 8 hardware threads are available per EU. In Large GRF mode, a single hardware thread in an EU can access 256 GRF registers, each of which is 64B wide. In this mode, 4 hardware threads are available per EU.

GRF Mode Specification at Command Line

Following is a list of backend options that the users can provide to guide GRF mode selection in the IGC graphics compiler.

-ze-opt-large-register-file:

Forces IGC to select large register file mode for ALL kernels
-ze-opt-intel-enable-auto-large-GRF-mode:

Enables IGC to select small/large GRF mode on a per-kernel basis based on heuristics
Default:

IGC picks small GRF mode for ALL kernels

OpenMP - GRF Mode Selection (AOT)

Following are the various commands that can be used to specify the requisite backend option during AOT compilation for OpenMP backends. Here, test.cpp can be any valid program:

icpx -fiopenmp -fopenmp-targets=spir64_gen -Xopenmp-target-backend
"-device pvc -options -ze-opt-large-register-file" test.cpp
// IGC will force large GRF mode for all kernels

icpx -fiopenmp -fopenmp-targets=spir64_gen -Xopenmp-target-backend
"-device pvc -options -ze-intel-enable-auto-large-GRF-mode" test.cpp
// IGC will use compiler heuristics to pick between small and large GRF
mode on a per-kernel basis

icpx -fiopenmp -fopenmp-targets=spir64_gen -Xopenmp-target-backend "-device pvc" test.cpp
// IGC will automatically use small GRF mode for all kernels

OpenMP – GRF Mode Selection (JIT)

For the OpenMP backend, we use the following environment variable to specify the backend option - LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS. Following are the different options available:

LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-ze-opt-large-register-file"
// IGC will force large GRF mode for all kernels
LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS
="-ze-intel-enable-auto-large-GRF-mode"
// IGC will use compiler heuristics to pick between small and large GRF mode on a per-kernel basis
env-var not set
// IGC will automatically use small GRF mode for all kernels

SYCL – GRF mode selection (AOT)

Following are the various commands that can be used to specify the requisite backend option during AOT compilation for SYCL backends. Here, test.cpp can be any valid SYCL program:

icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend
"-device pvc -options -ze-opt-large-register-file" test.cpp
// IGC will force large GRF mode for all kernels

icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend
"-device pvc -options -ze-intel-enable-auto-large-GRF-mode" test.cpp
// IGC will use compiler heuristics to pick between small and large GRF
mode on a per-kernel basis

icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device pvc" test.cpp
// IGC will automatically use small GRF mode for all kernels

SYCL – GRF mode selection (JIT)

Following are the various commands that can be used to specify the requisite backend option during JIT compilation for SYCL backends. Here, test.cpp can be any valid SYCL program:

icpx -fsycl -fsycl-targets=spir64 -Xsycl-target-backend
"-options -ze-opt-large-register-file" test.cpp
// IGC will force large GRF mode for all kernels

icpx -fsycl -fsycl-targets=spir64 -Xsycl-target-backend "-options -ze-intel-enable-auto-large-GRF-mode" test.cpp
// IGC will use compiler heuristics to pick between small and large GRF mode on a per-kernel basis

icpx -fsycl -fsycl-targets=spir64 -Xsycl-target-backend "" test.cpp
// IGC will automatically use small GRF mode for all kernels

Register Allocation Mode for SYCL - Per-kernel specification

Let’s start with the following kernel from the SYCL Getting Started guide.

cgh.parallel_for<class FillBuffer>(
   NumOfWorkItems, [=](sycl::id<1> WIid) {
     // Fill buffer with indexes
     Accessor[WIid] = (sycl::cl_int)WIid.get(0);
});

Currently, there is no GRF allocation mode specification. To specify large GRF mode, an header file is included as follows:

#include <sycl/ext/intel/experimental/kernel_properties.hpp>

Secondly, the following function call is added inside the function call tree of the kernel:

set_kernel_properties(kernel_properties::use_large_grf);

Here is the modified kernel:

cgh.parallel_for<class FillBuffer>(
   NumOfWorkItems, [=](sycl::id<1> WIid) {
     set_kernel_properties(kernel_properties::use_large_grf);
     // Fill buffer with indexes
     Accessor[WIid] = (sycl::cl_int)WIid.get(0);
});

Performance tuning using GRF mode selection

This section discusses the impact of GRF mode selection on device code performance. The examples shown in this section use the OpenMP offloading model and JIT compilation flow. Two of the main features that govern GRF mode selection are the following: (1) Register pressure for kernel code (2) Number of parallel execution threads. Following is a code snippet containing an OpenMP offload region and this will be used in the forthcoming analysis.

#pragma omp target teams distribute thread_limit(ZDIM *NX1 *NX1)
  for (int e = 0; e < nelt; e++) {
    double s_u[NX1 * NX1 * NX1];
    double s_D[NX1 * NX1];
    // SLM used for the three arrays here
    double s_ur[NX1 * NX1 * NX1];
    double s_us[NX1 * NX1 * NX1];
    double s_ut[NX1 * NX1 * NX1];

#pragma omp parallel for
    for (int inner = 0; inner < innerub; inner++) {
      int k = inner / (NX1 * NX1);
      int j = (inner - k * NX1 * NX1) / NX1;
      int i = inner - k * NX1 * NX1 - j * NX1;
      if (k == 0)
        s_D[I2(i, j)] = D[I2(i, j)];
      for (; k < NX1; k += ZDIM) {
        s_u[I3(i, j, k)] = u[I4(i, j, k, e)];
      }
    }

#pragma omp parallel for
    for (int inner = 0; inner < innerub; inner++) {
      int k = inner / (NX1 * NX1);
      int j = (inner - k * NX1 * NX1) / NX1;
      int i = inner - k * NX1 * NX1 - j * NX1;

      double r_G00, r_G01, r_G02, r_G11, r_G12, r_G22;

      for (; k < NX1; k += ZDIM) {
        double r_ur, r_us, r_ut;
        r_ur = r_us = r_ut = 0;
#ifdef FORCE_UNROLL
#pragma unroll NX1
#endif
        for (int m = 0; m < NX1; m++) {
          r_ur += s_D[I2(i, m)] * s_u[I3(m, j, k)];
          r_us += s_D[I2(j, m)] * s_u[I3(i, m, k)];
          r_ut += s_D[I2(k, m)] * s_u[I3(i, j, m)];
        }

        const unsigned gbase = 6 * I4(i, j, k, e);
        r_G00 = g[gbase + 0];
        r_G01 = g[gbase + 1];
        r_G02 = g[gbase + 2];
        s_ur[I3(i, j, k)] = r_G00 * r_ur + r_G01 * r_us + r_G02 * r_ut;
        r_G11 = g[gbase + 3];
        r_G12 = g[gbase + 4];
        s_us[I3(i, j, k)] = r_G01 * r_ur + r_G11 * r_us + r_G12 * r_ut;
        r_G22 = g[gbase + 5];
        s_ut[I3(i, j, k)] = r_G02 * r_ur + r_G12 * r_us + r_G22 * r_ut;
      }
    }

#pragma omp parallel for
    for (int inner = 0; inner < innerub; inner++) {
      int k = inner / (NX1 * NX1);
      int j = (inner - k * NX1 * NX1) / NX1;
      int i = inner - k * NX1 * NX1 - j * NX1;
      for (; k < NX1; k += ZDIM) {
        double wr = 0.0;
        for (int m = 0; m < NX1; m++) {
          double s_D_i = s_D[I2(m, i)];
          double s_D_j = s_D[I2(m, j)];
          double s_D_k = s_D[I2(m, k)];
          wr += s_D_i * s_ur[I3(m, j, k)] + s_D_j * s_us[I3(i, m, k)] +
                s_D_k * s_ut[I3(i, j, m)];
        }
        w[I4(i, j, k, e)] = wr;
      }
    }
  }

There are two parameters that can be modified here: (1) Unroll factor of inner loop in line number 36 (2) Number of OpenMP teams specified in line number 1. The unroll factor can be used to control register pressure. Greater the unroll factor, higher will be the register pressure. Number of OpenMP teams can be used to control the number of parallel threads. In this discussion, kernel execution time on the device is used as metric for performance. Actual numbers are not provided as they may vary based on user environments and device settings. Following are some observations:

When unrolling is turned off, use of small GRF mode is found to provide
better performance. This implies that the register pressure is not high
enough to get any benefits out of using large GRF mode.

When unrolling is turned on, use of large GRF mode is found to provide
better performance. This implies that the register pressure is high and
large GRF mode is required to accommodate this pressure.

Increase in number of teams tends to result in better performance for
larger (higher register pressure) workloads when using small GRF mode.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Small Register Mode vs. Large Register Mode

GRF - An Overview

GRF Mode Specification at Command Line

Register Allocation Mode for SYCL - Per-kernel specification

Performance tuning using GRF mode selection