Visible to Intel only — GUID: GUID-63717C53-5F77-455C-A0B2-BFB8F91473E9
Visible to Intel only — GUID: GUID-63717C53-5F77-455C-A0B2-BFB8F91473E9
Small Register Mode vs. Large Register Mode
There are two general-purpose register (GRF) modes available in Intel® Data Center GPU Max Series – small GRF mode and large GRF mode. There are two ways to control how Intel® Graphics Compiler (IGC) selects between these two modes: (1) command line and (2) per-kernel specification. In this chapter, we provide a step-by-step guideline on how users can provide this control for both SYCL and OpenMP backends.
GRF - An Overview
Intel® Data Center GPU Max Series products support two GRF modes: small GRF mode and large GRF mode. Each Execution Unit (EU) has a total of 64 KB of storage available in registers. In Small GRF mode, a single hardware thread in an EU can access 128 GRF registers, each of which is 64B wide. In this mode, 8 hardware threads are available per EU. In Large GRF mode, a single hardware thread in an EU can access 256 GRF registers, each of which is 64B wide. In this mode, 4 hardware threads are available per EU.
GRF Mode Specification at Command Line
Following is a list of backend options that the users can provide to guide GRF mode selection in the IGC graphics compiler.
-
- -ze-opt-large-register-file:
-
Forces IGC to select large register file mode for ALL kernels
-
- -ze-opt-intel-enable-auto-large-GRF-mode:
-
Enables IGC to select small/large GRF mode on a per-kernel basis based on heuristics
-
- Default:
-
IGC picks small GRF mode for ALL kernels
OpenMP - GRF Mode Selection (AOT)
Following are the various commands that can be used to specify the requisite backend option during AOT compilation for OpenMP backends. Here, test.cpp can be any valid program:
icpx -fiopenmp -fopenmp-targets=spir64_gen -Xopenmp-target-backend "-device pvc -options -ze-opt-large-register-file" test.cpp // IGC will force large GRF mode for all kernels
icpx -fiopenmp -fopenmp-targets=spir64_gen -Xopenmp-target-backend "-device pvc -options -ze-intel-enable-auto-large-GRF-mode" test.cpp // IGC will use compiler heuristics to pick between small and large GRF mode on a per-kernel basis
icpx -fiopenmp -fopenmp-targets=spir64_gen -Xopenmp-target-backend "-device pvc" test.cpp // IGC will automatically use small GRF mode for all kernels
OpenMP – GRF Mode Selection (JIT)
For the OpenMP backend, we use the following environment variable to specify the backend option - LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS. Following are the different options available:
LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-ze-opt-large-register-file" // IGC will force large GRF mode for all kernels LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS ="-ze-intel-enable-auto-large-GRF-mode" // IGC will use compiler heuristics to pick between small and large GRF mode on a per-kernel basis env-var not set // IGC will automatically use small GRF mode for all kernels
SYCL – GRF mode selection (AOT)
Following are the various commands that can be used to specify the requisite backend option during AOT compilation for SYCL backends. Here, test.cpp can be any valid SYCL program:
icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device pvc -options -ze-opt-large-register-file" test.cpp // IGC will force large GRF mode for all kernels
icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device pvc -options -ze-intel-enable-auto-large-GRF-mode" test.cpp // IGC will use compiler heuristics to pick between small and large GRF mode on a per-kernel basis
icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device pvc" test.cpp // IGC will automatically use small GRF mode for all kernels
SYCL – GRF mode selection (JIT)
Following are the various commands that can be used to specify the requisite backend option during JIT compilation for SYCL backends. Here, test.cpp can be any valid SYCL program:
icpx -fsycl -fsycl-targets=spir64 -Xsycl-target-backend "-options -ze-opt-large-register-file" test.cpp // IGC will force large GRF mode for all kernels
icpx -fsycl -fsycl-targets=spir64 -Xsycl-target-backend "-options -ze-intel-enable-auto-large-GRF-mode" test.cpp // IGC will use compiler heuristics to pick between small and large GRF mode on a per-kernel basis
icpx -fsycl -fsycl-targets=spir64 -Xsycl-target-backend "" test.cpp // IGC will automatically use small GRF mode for all kernels
Register Allocation Mode for SYCL - Per-kernel specification
Let’s start with the following kernel from the SYCL Getting Started guide.
cgh.parallel_for<class FillBuffer>( NumOfWorkItems, [=](sycl::id<1> WIid) { // Fill buffer with indexes Accessor[WIid] = (sycl::cl_int)WIid.get(0); });
Currently, there is no GRF allocation mode specification. To specify large GRF mode, an header file is included as follows:
#include <sycl/ext/intel/experimental/kernel_properties.hpp>
Secondly, the following function call is added inside the function call tree of the kernel:
set_kernel_properties(kernel_properties::use_large_grf);
Here is the modified kernel:
cgh.parallel_for<class FillBuffer>( NumOfWorkItems, [=](sycl::id<1> WIid) { set_kernel_properties(kernel_properties::use_large_grf); // Fill buffer with indexes Accessor[WIid] = (sycl::cl_int)WIid.get(0); });
Performance tuning using GRF mode selection
This section discusses the impact of GRF mode selection on device code performance. The examples shown in this section use the OpenMP offloading model and JIT compilation flow. Two of the main features that govern GRF mode selection are the following: (1) Register pressure for kernel code (2) Number of parallel execution threads. Following is a code snippet containing an OpenMP offload region and this will be used in the forthcoming analysis.
#pragma omp target teams distribute thread_limit(ZDIM *NX1 *NX1) for (int e = 0; e < nelt; e++) { double s_u[NX1 * NX1 * NX1]; double s_D[NX1 * NX1]; // SLM used for the three arrays here double s_ur[NX1 * NX1 * NX1]; double s_us[NX1 * NX1 * NX1]; double s_ut[NX1 * NX1 * NX1]; #pragma omp parallel for for (int inner = 0; inner < innerub; inner++) { int k = inner / (NX1 * NX1); int j = (inner - k * NX1 * NX1) / NX1; int i = inner - k * NX1 * NX1 - j * NX1; if (k == 0) s_D[I2(i, j)] = D[I2(i, j)]; for (; k < NX1; k += ZDIM) { s_u[I3(i, j, k)] = u[I4(i, j, k, e)]; } } #pragma omp parallel for for (int inner = 0; inner < innerub; inner++) { int k = inner / (NX1 * NX1); int j = (inner - k * NX1 * NX1) / NX1; int i = inner - k * NX1 * NX1 - j * NX1; double r_G00, r_G01, r_G02, r_G11, r_G12, r_G22; for (; k < NX1; k += ZDIM) { double r_ur, r_us, r_ut; r_ur = r_us = r_ut = 0; #ifdef FORCE_UNROLL #pragma unroll NX1 #endif for (int m = 0; m < NX1; m++) { r_ur += s_D[I2(i, m)] * s_u[I3(m, j, k)]; r_us += s_D[I2(j, m)] * s_u[I3(i, m, k)]; r_ut += s_D[I2(k, m)] * s_u[I3(i, j, m)]; } const unsigned gbase = 6 * I4(i, j, k, e); r_G00 = g[gbase + 0]; r_G01 = g[gbase + 1]; r_G02 = g[gbase + 2]; s_ur[I3(i, j, k)] = r_G00 * r_ur + r_G01 * r_us + r_G02 * r_ut; r_G11 = g[gbase + 3]; r_G12 = g[gbase + 4]; s_us[I3(i, j, k)] = r_G01 * r_ur + r_G11 * r_us + r_G12 * r_ut; r_G22 = g[gbase + 5]; s_ut[I3(i, j, k)] = r_G02 * r_ur + r_G12 * r_us + r_G22 * r_ut; } } #pragma omp parallel for for (int inner = 0; inner < innerub; inner++) { int k = inner / (NX1 * NX1); int j = (inner - k * NX1 * NX1) / NX1; int i = inner - k * NX1 * NX1 - j * NX1; for (; k < NX1; k += ZDIM) { double wr = 0.0; for (int m = 0; m < NX1; m++) { double s_D_i = s_D[I2(m, i)]; double s_D_j = s_D[I2(m, j)]; double s_D_k = s_D[I2(m, k)]; wr += s_D_i * s_ur[I3(m, j, k)] + s_D_j * s_us[I3(i, m, k)] + s_D_k * s_ut[I3(i, j, m)]; } w[I4(i, j, k, e)] = wr; } } }
There are two parameters that can be modified here: (1) Unroll factor of inner loop in line number 36 (2) Number of OpenMP teams specified in line number 1. The unroll factor can be used to control register pressure. Greater the unroll factor, higher will be the register pressure. Number of OpenMP teams can be used to control the number of parallel threads. In this discussion, kernel execution time on the device is used as metric for performance. Actual numbers are not provided as they may vary based on user environments and device settings. Following are some observations:
When unrolling is turned off, use of small GRF mode is found to provide better performance. This implies that the register pressure is not high enough to get any benefits out of using large GRF mode. When unrolling is turned on, use of large GRF mode is found to provide better performance. This implies that the register pressure is high and large GRF mode is required to accommodate this pressure. Increase in number of teams tends to result in better performance for larger (higher register pressure) workloads when using small GRF mode.