Developer Guide

Contents

Using More GPU Resources

We can improve the performance of offloaded code by using a larger number of work-items that can run in parallel, thus utilizing more GPU resources (filling up the GPU).

Collapse Clause

One way to increase parallelism in a loop nest is to use the
collapse
clause to collapse two or more loops in the loop nest. Collapsing results in a larger number of iterations that can run in parallel, thus using more work-items on the GPU.
In the following example, a loop nest composed of four perfectly nested loops is offloaded onto the GPU. The
parallel for
directive indicates that the outermost loop (on line 47) is parallel. The number of iterations in the loop is BLOCKS, which is equal to 8.
// clang-format off #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include <omp.h> #define P 16 #define BLOCKS 8 #define SIZE (BLOCKS * P * P * P) #define MAX 100 #define scaled_rand() ((rand() % MAX) / (1.0 * MAX)) #define IDX2(i, j) (i * P + j) #define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k) int main(int argc, char *argv[]) { double w[SIZE]; /* output */ double u[SIZE], dx[P * P]; /* input */ int b, i, j, k, l; /* loop counters */ double start, end; /* timers */ omp_set_default_device(0); /* dummy target region, so as not to measure startup time. */ #pragma omp target { ; } /* initialize input with random values */ srand(0); for (int i = 0; i < SIZE; i++) u[i] = scaled_rand(); for (int i = 0; i < P * P; i++) dx[i] = scaled_rand(); /* map data to device */ #pragma omp target enter data map(to: u[0:SIZE], dx[0:P * P]) start = omp_get_wtime(); /* offload the kernel with no collapse clause */ #pragma omp target teams distribute parallel for \ private(b, i, j, k, l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { double ur = 0.; double us = 0.; double ut = 0.; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur * us * ut; } } } } end = omp_get_wtime(); #pragma omp target exit data map(from: w[0:SIZE]) /* print result */ printf("no-collapse-clause: w[0]=%lf time=%lf\n", w[0], end - start); return 0; }
libomptarget.so debug information (emitted at runtime when the environment variable LIBOMPTARGET_DEBUG=1) shows the ND-range partitioning of loop iterations and how parallelism is increased by using the
collapse
clause. In the output,
Lb
and
Ub
refer to the parallel loop lower bound and upper bound, respectively, in each dimension of the partitioning.
Without the
collapse
clause, LIBOMPTARGET_DEBUG=1 output shows the following information about the
target
region on line 45.
Libomptarget --> Launching target execution __omp_offloading_802_b85fb2__Z4main_l45 with pointer 0x0000000000ff1b48 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000000ff1b48... Target LEVEL0 RTL --> Assumed kernel SIMD width is 32 Target LEVEL0 RTL --> Preferred group size is multiple of 64 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 7, Stride = 1 Target LEVEL0 RTL --> Group sizes = {8, 1, 1} Target LEVEL0 RTL --> Group counts = {1, 1, 1}
Note that without the
collapse
clause, the number of parallel loop iterations = 8, since the upper bound of the outermost loop (BLOCKS) = 8. In this case, we end up with one work-group that has 8 work-items (total work-group count = 1 x 1 x 1 = 1, and each work-group size = 8 x 1 x 1 = 8 work-items). The kernel is vectorized using SIMD 32, which means every 32 work-items are combined into one sub-group. Since we have only 8 work-items, it follows that we have only one sub-group where not all SIMD lanes are active.
We can increase parallelism and hence the number of work-items used on the GPU by adding a
collapse
clause on the
parallel for
directive. We start by adding the
collapse(2)
clause, as shown in the following modified example.
/* offload the kernel with collapse clause */ #pragma omp target teams distribute parallel for collapse(2) \ private(b, i, j, k, l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { double ur = 0.; double us = 0.; double ut = 0.; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur * us * ut; } } } }
LIBOMPTARGET_DEBUG=1 output shows the following partitioning when
collapse(2)
is used.
Libomptarget --> Launching target execution __omp_offloading_802_b85fb3__Z4main_l45 with pointer 0x0000000001dffc98 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000001dffc98... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 15, Stride = 1 Target LEVEL0 RTL --> Level 1: Lb = 0, Ub = 7, Stride = 1 Target LEVEL0 RTL --> Group sizes = {16, 1, 1} Target LEVEL0 RTL --> Group counts = {1, 8, 1}
Note that with
collapse(2)
, the number of parallel loop iterations = BLOCKS x P = 8 x 16 = 128. In this case, we end up with 8 work-groups, and each work-group has 16 work-items (total work-group count = 1 x 8 x 1 = 8, and each work-group size = 16 x 1 x 1 = 16 work-items). The kernel is vectorized using SIMD 16, which means every 16 work-items are combined into one sub-group. It follows that each work-group has one sub-group.
On the other hand, if we use the
collapse(3)
clause, LIBOMPTARGET_DEBUG=1 output shows the following partitioning.
Libomptarget --> Launching target execution __omp_offloading_802_b85fb4__Z4main_l45 with pointer 0x0000000000a2b9b8 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000000a2b9b8... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 15, Stride = 1 Target LEVEL0 RTL --> Level 1: Lb = 0, Ub = 15, Stride = 1 Target LEVEL0 RTL --> Level 2: Lb = 0, Ub = 7, Stride = 1 Target LEVEL0 RTL --> Group sizes = {16, 1, 1} Target LEVEL0 RTL --> Group counts = {1, 16, 8}
With
collapse(3)
, the number of resulting parallel loop iterations = BLOCKS x P x P = 8 x 16 x 16 = 2048. In this case, we end up with 128 work-groups, and each work-group has 16 work-items (total work-group count = 1 x 16 x 8 = 128, and each work-group size = 16 x 1 x 1 = 16 work-items). The kernel is vectorized using SIMD 16, which means every 16 work-items are combined into one sub-group. It follows that each work-group has one sub-group.
If we were to use the
collapse(4)
clause, instead of
collapse(3)
, LIBOMPTARGET_DEBUG=1 output shows the following partitioning.
Libomptarget --> Launching target execution __omp_offloading_802_b85fb5__Z4main_l45 with pointer 0x0000000000aeec98 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000000aeec98... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1 Target LEVEL0 RTL --> Group sizes = {32, 1, 1} Target LEVEL0 RTL --> Group counts = {1024, 1, 1}
With
collapse(4)
, the number of resulting parallel loop iterations = BLOCKS x P x P x P = 8 x 16 x 16 x 16 = 32768. In this case, the compiler and runtime decide on one-dimensional partitioning, where we have 1024 work-groups, and each work-group has 32 work-items (total work-group count = 1024 x 1 x 1 = 1024, and each work-group size = 32 x 1 x 1 = 32 work-items). The kernel is vectorized using SIMD 16, which means every 16 work-items are combined into one sub-group. It follows that each work-group has two sub-groups.
Using the
collapse
clause significantly reduces the runtime of the loop nest. The performance of the various versions when running on the particular ATS GPU used (1-tile only) was as follows:
no collapse version : 0.028665 seconds collapse(2) version : 0.003309 seconds collapse(3) version : 0.002016 seconds collapse(4) version : 0.002016 seconds
The above timings show that adding the
collapse(3)
or
collapse(4)
clause gives a performance boost of about 14x (0.002016 seconds versus 0.028665 seconds).
Note that on the GPU, the
collapse
clause may not result in any actual loop collapsing at all, but the clause conveys to the compiler and runtime the degree of parallelism in the loop nest and is used in determine the ND-range partitioning.
To take advantage of vector loads and stores, it is recommended that the innermost loop in a loop nest not be included in the collapsing so it can be vectorized. Best performance is achieved when the innermost loop has unit stride and its number of iterations is at least as large as the SIMD width.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.