Developer Guide

DPC++ Thread Hierarchy and Mapping

The DPC++ execution model exposes an abstract view of GPUs with Intel
®
Architecture. The DPC++ thread hierarchy consists of a 1-, 2-, or 3-dimensional grid of work-items. These work-items are grouped into equal sized thread groups called work-groups. Threads in a work-group are further divided into equal sized vector groups called sub-groups.
Work-item
A work-item represents one of a collection of parallel executions of a kernel.
Sub-group
A sub-group represents a short range of consecutive work-items that are processed together as a SIMD vector of length 8, 16, 32, or a multiple of the native vector length of a CPU with Intel
®
UHD Graphics.
Work-group
A work-group is a 1-, 2-, or 3-dimensional set of threads within the thread hierarchy. In DPC++, synchronization across work-items is only possible with barriers for the work-items within the same work-group.

nd_range

An
nd_range
divides the thread hierarchy into 1-, 2-, or 3-dimensional grids of work-groups. It is represented by the global range, the local range of each work-group.
Thread hierarchy
Thread hierarchy
The diagram above illustrates the relationship among ND-Range, work-group, sub-group, and work-item.

Thread Synchronization

DPC++ provides two synchronization mechanisms that can be called within a kernel function. Both are only defined for work-items within the same work-group. DPC++ does not provide any synchronization mechanism inside a kernel across all work-items across the entire
nd_range
.
mem_fence
inserts a memory fence on global and local memory access across all work-items in a work-group.
barrier
inserts a memory fence and blocks the execution of all work-items within the work-group until all work-items have reached its location.

Mapping Work-groups to SubSlices

In the rest of this chapter, we will explain how to pick a proper work-group size to maximize the occupancy of the GPU resources. We will use the Intel
®
Tiger Lake Processors with Intel
®
Iris
®
X
e
Graphics (TGL) GPU as the execution target. From the Key architecture parameters, Intel UHD Graphics table, we summarize the architecture parameters for Intel
®
Iris
®
X
e
Graphics (TGL) GPU below:
Intel
®
Iris
®
X
e
Graphics (TGL) GPU
EUs
Threads
Operations
Maximum Work Group Size
Maximum Work Groups
Each SubSlice
16
LaTex Math image.
LaTex Math image.
512
16
Total
LaTex Math image.
LaTex Math image.
LaTex Math image.
512
LaTex Math image.
The maximum work-group size is a constraint imposed by the hardware and GPU driver. One can query the maximum work-group using
device::get_info<cl::sycl::info::device::max_work_group_size>()
on the supported size.
Let’s start with a simple kernel:
auto command_group = [&](auto &cgh) { cgh.parallel_for(sycl::range<3>(64, 64, 64), // global range [=](item<3> it) { // (kernel code) }) }
This kernel contains 262,144 work-items structured as a 3D range of 64 x 64 x 64. It leaves the work-group and sub-group size selection to the compiler. To fully utilize the 5376 parallel operations available in the GPU slice, the compiler must choose a proper work group size.
The occupancy of an Intel GPU for a given kernel and work group parameters can be calculated using the Intel® GPU Occupancy Calculator.
The two most important GPU resources are:
Thread Contexts
The kernel should have a sufficient number of threads to utilize the GPU’s thread contexts.
SIMD Units and SIMD Registers
The kernel should be organized to vectorize the work-items and utilize the SIMD registers.
In a DPC++ kernel, the programmer can affect the work distribution by structuring the kernel with proper work-group size, sub-group size, and organizing the work-items for efficient vector execution. Writing efficient vector kernels is covered in a separate section. In this chapter, we will only focus on work-group and sub-group size selection.
Thread contexts are easier to utilize than SIMD vector. Therefore, we start with selecting the number of threads in a work-group. Each SubSlice has 112 thread contexts, but usually we cannot use all the threads if the kernel is also vectorized by 8 (112 x 8 = 896 > 512). From this, we can derive that the maximum number of threads in a work-group is 64 (512 / 8). Because each SubSlice can execute a maximum of 16 work-groups, we need at least 8 threads in a work-group (8 x 16 = 128 > 112) to fully utilize a SubSlice. Hence, we can derive the minimum number of threads in work-group should be 8.
DPC++ does not provide a mechanism to directly set the number of threads in a work-group. However, we can use work-group size and SIMD sub-group size to set the number of threads:
Work group size = Threads x SIMD sub-group size
We can increase the sub-group size as long as there are a sufficient number of registers for the kernel after widening. Note that each EU has 128 SIMD8 registers. There is much room for widening on simple kernels. The effect of increasing sub-group size is similar to loop unrolling: while each EU still executes eight 32-bit operations per cycle, the amount of work per work-group interaction is doubled/quadruped. In DPC++, a programmer can explicitly specify sub-group size using
intel::reqd_sub_group_size({8|16|32})
to override the compiler’s selection.
The table below summarizes selection criteria of threads, sub-group sizes to keep all GPU resources occupied for TGL:
Configurations to ensure full occupancy
Minimum Threads
Maximum Threads
Minimum Sub-group Size
Maximum Sub-group Size
Maximum Work-group Size
Constraint
8
64
8
32
512
LaTex Math image.
Back to our example program, if a programmer chooses a work-group size less than 64 for sub-group size 8, less than 128 for sub-group size 16, or less than 256 for sub-group size 32, then it would not be able to fully utilize TGL GPU’s thread contexts. Choosing a larger work-group size has the additional advantage of reducing the number of rounds of work-group dispatching.
Impact of Work-item Synchronization within Work-group
Let’s look at a kernel requiring work-item synchronization:
auto command_group = [&](auto &cgh) { cgh.parallel_for(nd_range(sycl::range(64, 64, 128), // global range sycl::range(1, R, 128) // local range ), [=](sycl::nd_item<3> item) { // (kernel code) // Internal synchronization item.barrier(access::fence_space::global_space); // (kernel code) }) }
This kernel is similar to the previous example, except it requires work-group barrier synchronization. Work-item synchronization is only available to work-items within the same work-group. A programmer must pick a work-group local range using
nd_range
and
nd_item
. Because synchronization is implemented using a SubSlice’s SLM for shared variables, all the work-items of a work-group must be allocated to the same SubSlice, which affects SubSlice occupancy and kernel performance.
In this kernel, the local range of work-group is given as
range(1, R, 128)
. Assuming the sub-group size is eight, let’s look at how the values of variable
R
affect EU occupancy. In the case of
R=1
, the local group range is (1, 1, 128) and work-group size is 128. The SubSlice allocated for a work-group contains only 16 threads out of 112 available thread contexts (i.e., very low occupancy). However, the system can dispatch 7 work-groups to the same SubSlices to reach full occupancy at the expense of a higher number of dispatches.
In the case of
R>4
, the work-group size will exceed the system supported maximum work-group size of 512, such that the kernel will fail to launch. In the case of
R=4
, a SubSlice is only 57% occupied (4/7) and the three unused thread contexts are not sufficient to accommodate another work-group, wasting 43% of the available EU capacities. Note that the driver may still be able to dispatch partial work-group to unused SubSlice. However, because of the barrier in the kernel, the partially dispatch work items would not be able to pass the barriers until the rest of the work group is dispatched. In most cases, the kernel’s performance would not benefit much from the partial dispatch. Hence, it is important to avoid this problem by properly choosing the work-group size.
The table below summarizes the tradeoffs among group size, number of threads, SubSlice utilization, and occupancy.
Utilization for various configurations
Work-items
Group Size
Threads
SubSlice Utilization
SubSlice Occupancy
LaTex Math image.
(R=1) 128
16
LaTex Math image.
LaTex Math image. with 7 work-groups
LaTex Math image.
(R=2) LaTex Math image.
LaTex Math image.
LaTex Math image.
LaTex Math image. with 3 work-groups
LaTex Math image.
(R=3) LaTex Math image.
LaTex Math image.
LaTex Math image.
LaTex Math image. with 2 work-groups
LaTex Math image.
(R=4) LaTex Math image.
LaTex Math image.
LaTex Math image.
LaTex Math image. maximum
LaTex Math image.
(R>4) 640+
Fail to launch
Impact of Local Memory within Work-group
Let’s look at an example where a kernel allocates local memory for a work-group:
auto command_group = [&](auto &cgh) { // local memory variables shared among work items sycl::accessor<int, 1, sycl::access::mode::read_write, sycl::access::target::local> myLocal(sycl::range(R), cgh); cgh.parallel_for(nd_range(sycl::range<3>(64, 64, 128), // global range sycl::range<3>(1, R, 128) // local range ), [=](ngroup<3> myGroup) { // (work group code) myLocal[myGroup.get_local_id()[1]] = ... }) }
Because work-group local variables are shared among its work-items, they are allocated in a SubSlice’s SLM. Therefore, this work-group must be allocated to a single SubSlice, same as the intra-group synchronization. In addition, one must also weigh the sizes of local variables under different group size options such that the local variables fit within a SubSlice’s 64KB SLM capacity limit.
A Detailed Example
Before we conclude this section, let’s look at the hardware occupancies from the variants of a simple vector add example. Using Intel
®
Iris
®
X
e
graphics from TGL platform as the underlying hardware with the resource parameters specified in Intel® Iris® Xe Graphics (TGL) GPU.
int VectorAdd1(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum, int iter) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto start = std::chrono::steady_clock::now(); auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { for (int j = 0; j < iter; j++) sum_acc[i] = a_acc[i] + b_acc[i]; }); }); q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "VectorAdd1 completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); } // end VectorAdd1
The
VectorAdd1
above lets the compiler select the work-group size and SIMD width. In this case, the compiler selects a work-group size of 512 and a SIMD width of 32 because the kernel’s register pressure is low.
int VectorAdd2(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum, int iter) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); size_t num_groups = groups; size_t wg_size = 512; // get the max wg_sie instead of 512 size_t wg_size = 512; auto start = std::chrono::steady_clock::now(); q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for( sycl::nd_range<1>(num_groups * wg_size, wg_size), [= ](sycl::nd_item<1> index) [[intel::reqd_sub_group_size(32)]] { size_t grp_id = index.get_group()[0]; size_t loc_id = index.get_local_id(); size_t start = grp_id * mysize; size_t end = start + mysize; for (int i = 0; i < iter; i++) for (size_t i = start + loc_id; i < end; i += wg_size) { sum_acc[i] = a_acc[i] + b_acc[i]; } }); }); q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "VectorAdd2<" << groups << "> completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); } // end VectorAdd2
In the
VectorAdd2
above, we explicitly specify the work-group size of 512, SIMD width of 32, and a variable number of work-groups as a function parameter groups.
In the absence of intra-work group synchronization, we know that threads from any work-group can be dispatched to any SubSlice. Dividing the number of threads by the number of available thread contexts in the GPU gives us an estimate of the GPU hardware occupancy. The following table calculates the GPU hardware occupancy using the TGL Intel
®
Iris
®
X
e
architecture parameters for each of the above two kernels with various arguments.
Occupancy
Program Occupancy
Work-groups
Work-items
Work-group Size
SIMD
Threads Work-group
Threads
Occupancy
VectorAdd1
53760
13.7M
512
32
16
430K
100%
VectorAdd2<1>
1
512
512
32
16
16
16/672 = 2.4%
VectorAdd2<2>
2
1024
512
32
16
32
32/672 = 4.8%
VectorAdd2<3>
3
1536
512
32
16
48
48/672 = 7.1%
VectorAdd2<4>
4
2048
512
32
16
64
64/672 = 9.5%
VectorAdd2<5>
5
2560
512
32
16
80
80/672 = 11.9%
VectorAdd2<6>
6
3072
512
32
16
96
96/672 = 14.3%
VectorAdd2<7>
7
3584
512
32
16
112
112/672 = 16.7%
VectorAdd2<8>
8
4096
512
32
16
128
128/672 = 19%
VectorAdd2<12>
12
6144
512
32
16
192
192/672 = 28.6%
VectorAdd2<16>
16
8192
512
32
16
256
256/672 = 38.1%
VectorAdd2<20>
20
10240
512
32
16
320
320/672 = 47.7%
VectorAdd2<24>
24
12288
512
32
16
384
384/672 = 57.1%
VectorAdd2<28>
28
14336
512
32
16
448
448/672 = 66.7%
VectorAdd2<32>
32
16384
512
32
16
512
512/672 = 76.2%
VectorAdd2<36>
36
18432
512
32
16
576
576/672 = 85.7%
VectorAdd2<40>
40
20480
512
32
16
640
640/672 = 95.2%
VectorAdd2<42>
42
21504
512
32
16
672
672/672 = 100%
VectorAdd2<44>
44
22528
512
32
16
704
100% then 4.7%
VectorAdd2<48>
48
24576
512
32
16
768
100% then 14.3%
The following VTune
analyzer chart for
VectorAdd2
with various work-group sizes confirms the accuracy of our estimate. The numbers in the grid view vary slightly from the estimate because the grid view gives an average across the entire execution.
Occupancy for VectorAdd2 as shown by VTune
Occupancy for VectorAdd2 as shown by VTune
The following timeline view gives the occupancy over a period of time and it can be seen that the occupancy metric is accurate for large part of the kernel execution and tapers off towards the end due to the varying times at which each of the threads finish their execution.
VectorAdd2 timeline view
VectorAdd2 timeline view
The kernel
VectorAdd3
shown below is similar to the kernels above with two important differences.
  1. It can be instantiated with the number of work-groups, work-group size and sub-group size as template parameters. This allows us to do experiments to investigate the impact of number of sub-groups and work-groups on thread occupancy.
  2. The amount of work done inside the kernel is dramatically increased to ensure that these kernels are resident in the execution units doing work for a substantial amount of time.
template <int groups, int wg_size, int sg_size> int VectorAdd3(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum, int iter) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); size_t num_groups = groups; auto start = std::chrono::steady_clock::now(); q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for( sycl::nd_range<1>(num_groups * wg_size, wg_size), [= ](sycl::nd_item<1> index) [[intel::reqd_sub_group_size(sg_size)]] { size_t grp_id = index.get_group()[0]; size_t loc_id = index.get_local_id(); size_t start = grp_id * mysize; size_t end = start + mysize; for (int i = 0; i < iter; i++) for (size_t i = start + loc_id; i < end; i += wg_size) { sum_acc[i] = a_acc[i] + b_acc[i]; } }); }); q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "VectorAdd3<" << groups << "> completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); } // end VectorAdd3
The kernel
VectorAdd4
is similar to the kernel
VectorAdd3
above except that it has a barrier synchronization at the beginning and end of the kernel execution. This barrier although functionally is not needed, will significantly impact the way in which threads are scheduled on the hardware.
template <int groups, int wg_size, int sg_size> int VectorAdd4(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum, int iter) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); size_t num_groups = groups; auto start = std::chrono::steady_clock::now(); q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for( sycl::nd_range<1>(num_groups * wg_size, wg_size), [= ](sycl::nd_item<1> index) [[intel::reqd_sub_group_size(sg_size)]] { index.barrier(sycl::access::fence_space::local_space); size_t grp_id = index.get_group()[0]; size_t loc_id = index.get_local_id(); size_t start = grp_id * mysize; size_t end = start + mysize; for (int i = 0; i < iter; i++) { for (size_t i = start + loc_id; i < end; i += wg_size) { sum_acc[i] = a_acc[i] + b_acc[i]; } } }); }); q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "VectorAdd4<" << groups << "> completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); } // end VectorAdd4
To illustrate the manner in which threads are scheduled, the above two kernels are called with 8 work-groups, sub-group size of 8 and work-group size of 320 as shown below. Based on the choice of work-group size and sub-group size, there will be 40 threads per work-group which need to be scheduled by the hardware.
Initialize(sum); VectorAdd3<8, 320, 8>(q, a, b, sum, 10000); Initialize(sum); VectorAdd4<8, 320, 8>(q, a, b, sum, 10000);
The chart from Intel
®
VTune
below shows that the measured GPU occupancy for
VectorAdd3
and
VectorAdd4
kernels.
GPU occupancy VectorAdd3, VectorAdd4 kernels
GPU occupancy VectorAdd3, VectorAdd4 kernels
For
VectorAdd3
kernel it can be seen that there are two phases for occupancies. One is 33.3% (224 threads occupancy) and the other is 14.3% (96 threads occupancy) on a TGL machine which has a total of 672 threads. Since we know that there are a total of eight work-groups with each work-group having 40 threads, we can conclude that there are two sub-slices (each of which have 112 threads) into which the threads of six work-groups are scheduled. This means that 40 threads each of four work-groups are scheduled and 32 threads each from two other work-groups are scheduled in the first phase. Then later in the second phase we have 40 threads from remaining two work-groups are scheduled for execution.
For
VectorAdd4
kernel it can be seen that there are three phases of occupancies which are 45.3% (304 threads), 39.3% (264 threads) and 11.9% (80 threads). In the first phase, all eight work-groups are scheduled together on 3 sub-slices with two sub-slices getting 112 threads each (80 from two work-groups and 32 from one work-group) and one sub-slice getting 80 threads (from two work-groups). In the second phase, one work-group completed execution which gives us occupancy of (304-40=264). In the last phase, the remaining eight threads of two work-groups are scheduled and they complete the execution.
The same kernels as above when run with different work-group size which is chosen to be a multiple of the number of threads in a sub-slice and lot more work-groups gets good utilization of the hardware achieving close to 100% occupancy as shown below.
Initialize(sum); VectorAdd3<24, 224, 8>(q, a, b, sum, 10000); Initialize(sum); VectorAdd4<24, 224, 8>(q, a, b, sum, 10000);
This kernel execution has a different thread occupancy since we have lot more threads and also the work-group size is a multiple of the number of threads in a sub-slice - this is shown below in the thread occupancy metric on Vtune time-line.
Thread occupancy metric on Vtune
Thread occupancy metric on Vtune
Note that the above schedule is a guess based on the different occupancy numbers since we do not yet have a way to examine the per slice based occupancy numbers.
One can run different experiments with the above kernels to gain better understanding of the way in which the GPU hardware schedules the software threads on the Execution Units for execution. One needs to be careful about the work-group size and sub-group size in addition to a large number of work-groups to ensure effective utilization of the GPU hardware.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.