Developer Guide

Contents

Minimizing Data Transfers and Memory Allocations

When offloading computations onto the GPU, it is important to minimize data transfers between the host and the device, and reduce memory allocations on the device. There are various ways to achieve this, as described below.

Using
target enter data
and
target exit data
Directives

When variables are used by multiple
target
constructs, the
target enter data
and
target exit data
pair of directives can be used to minimize data transfers between host and device.
Place the
target enter data
directive before the first
target
construct to transfer data from host to device, and place the
target exit data
directive after the last
target
construct to transfer data from device to host.
Consider the following example where we have two
target
constructs (on lines 42 and 66), and each
target
construct reads arrays
dx
and
u
and reads and writes to array
w
.
// clang-format off #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include <omp.h> #define P 16 #define BLOCKS 8 #define SIZE (BLOCKS * P * P * P) #define MAX 100 #define scaled_rand() ((rand() % MAX) / (1.0 * MAX)) #define IDX2(i, j) (i * P + j) #define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k) int main(int argc, char *argv[]) { double w[SIZE]; /* output */ double u[SIZE], dx[P * P]; /* input */ int b, i, j, k, l; /* loop counters */ double start, end; /* timers */ omp_set_default_device(0); /* dummy target region, so as not to measure startup time. */ #pragma omp target { ; } /* initialize input with random values */ srand(0); for (int i = 0; i < SIZE; i++) u[i] = scaled_rand(); for (int i = 0; i < P * P; i++) dx[i] = scaled_rand(); start = omp_get_wtime(); /* offload kernel #1 */ #pragma omp target teams distribute parallel for collapse(4) \ map(to: u[0:SIZE], dx[0:P * P]) map(from: w[0:SIZE]) \ private(b, i, j, k, l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { double ur = 0.; double us = 0.; double ut = 0.; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur * us * ut; } } } } /* offload kernel #2 */ #pragma omp target teams distribute parallel for collapse(4) \ map(to: u[0:SIZE], dx[0:P * P]) map(tofrom: w[0:SIZE]) \ private(b, i, j, k, l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { double ur = b + i + j - k; double us = b + i + j - k; double ut = b + i + j - k; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] += ur * us * ut; } } } } end = omp_get_wtime(); /* print result */ printf("target region: w[0]=%lf time=%lf\n", w[0], end - start); return 0; }
When the first
target
construct (on line 42) is encountered:
  • Since arrays
    dx
    and
    u
    appear in a
    map
    clause with the
    to
    map-type, storage is allocated for arrays
    dx
    and
    u
    on the device, and the values of
    dx
    and
    u
    on the host are copied to the corresponding arrays on the device.
  • Since array
    w
    appears in a
    map
    clause with the
    from
    map-type, uninitialized storage is allocated for array
    w
    on the device.
At the end of the first
target
region:
  • Since array
    w
    appears in a
    map
    clause with the
    from
    map-type, the values of array
    w
    on the device are copied to the original array
    w
    on the host.
When the second
target
construct (on line 66) is encountered:
  • Since arrays
    dx
    ,
    u
    , and
    w
    appear in a
    map
    clause with the
    to
    map-type, storage is allocated for arrays
    dx
    ,
    u
    , and
    w
    on the device and the values of arrays
    dx
    ,
    u
    , and
    w
    on the host are copied to the corresponding arrays on the device.
At the end of the second
target
region:
  • Since array
    w
    appears in a
    map
    clause with the
    from
    map-type, the values of array
    w
    on the device are copied to the original array
    w
    on the host.
LIBOMPTARGET_DEBUG=1 output shows that both
target
regions (on lines 42 and 66) have the data partitioning.
Libomptarget --> Launching target execution __omp_offloading_802_b85fc9__Z4main_l42 with pointer 0x0000000000cf3d28 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000000cf3d28... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1 Target LEVEL0 RTL --> Group sizes = {32, 1, 1} Target LEVEL0 RTL --> Group counts = {1024, 1, 1}
Libomptarget --> Launching target execution __omp_offloading_802_b85fc9__Z4main_l66 with pointer 0x0000000000cf3d30 (index=2). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000000cf3d30... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1 Target LEVEL0 RTL --> Group sizes = {32, 1, 1} Target LEVEL0 RTL --> Group counts = {1024, 1, 1}
The amount of data transferred (for both
target
regions) can be seen in LIBOMPTARGET_DEBUG=1 output by grepping for
"Libomptarget --> Moving"
:
$ grep "Libomptarget --> Moving" test_no_target_enter_exit_data.debug Libomptarget --> Moving 2048 bytes (hst:0x00007ffee944a8f0) -> (tgt:0xffffd556aa780000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffee93ca8f0) -> (tgt:0xffffd556aa680000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa6c0000) -> (hst:0x00007ffee940a8f0) Libomptarget --> Moving 2048 bytes (hst:0x00007ffee944a8f0) -> (tgt:0xffffd556aa780000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffee93ca8f0) -> (tgt:0xffffd556aa680000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffee940a8f0) -> (tgt:0xffffd556aa6c0000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa6c0000) -> (hst:0x00007ffee940a8f0)
You can reduce the copying of data from host to device and vice versa by using the
target enter data
and
target exit data
directives as shown in this modified example.
// clang-format off #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include <omp.h> #define P 16 #define BLOCKS 8 #define SIZE (BLOCKS * P * P * P) #define MAX 100 #define scaled_rand() ((rand() % MAX) / (1.0 * MAX)) #define IDX2(i, j) (i * P + j) #define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k) int main(int argc, char *argv[]) { double w[SIZE]; /* output */ double u[SIZE], dx[P * P]; /* input */ int b, i, j, k, l; /* loop counters */ double start, end; /* timers */ omp_set_default_device(0); /* dummy target region, so as not to measure startup time. */ #pragma omp target { ; } /* initialize input with random values */ srand(0); for (int i = 0; i < SIZE; i++) u[i] = scaled_rand(); for (int i = 0; i < P * P; i++) dx[i] = scaled_rand(); start = omp_get_wtime(); /* map data to device. alloc for w avoids map(tofrom: w[0:SIZE]) on target by default. */ #pragma omp target enter data map(to: u[0:SIZE], dx[0:P * P]) \ map(alloc: w[0:SIZE]) /* offload kernel #1 */ #pragma omp target teams distribute parallel for collapse(4) \ private(b, i, j, k, l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { double ur = 0.; double us = 0.; double ut = 0.; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur * us * ut; } } } } /* offload kernel #2 */ #pragma omp target teams distribute parallel for collapse(4) \ private(b, i, j, k, l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { double ur = b + i + j - k; double us = b + i + j - k; double ut = b + i + j - k; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] += ur * us * ut; } } } } #pragma omp target exit data map(from: w[0:SIZE]) end = omp_get_wtime(); /* print result */ printf("target region: w[0]=%lf time=%lf\n", w[0], end - start); return 0; }
In this example, when the
target enter data
directive (on line 43) is encountered:
  • Since arrays
    dx
    and
    u
    appear in a
    map
    clause with the
    to
    map-type, storage is allocated for arrays
    dx
    and
    u
    on the device, and the values of arrays
    dx
    and
    u
    on the host are copied to the corresponding arrays on the device.
  • Since array
    w
    appears in a
    map
    clause with the
    alloc
    map-type, uninitialized storage is allocated for array
    w
    on the device.
When the first
target
construct (on line 47) is encountered:
  • The runtime checks whether storage corresponding to arrays
    dx
    ,
    u
    , and
    w
    already exists on the device. Since it does, no data transfer occurs.
At the end of the first
target
region:
  • The runtime will recognize that the storage for arrays
    dx
    ,
    u
    , and
    w
    should remain on the device, and no copy back from the device to the host occurs.
When the second
target
construct (on line 70) is encountered:
  • Again no data transfer from the host to the device occurs.
At the end of the second
target
region:
  • The runtime will recognize that the storage for the arrays
    dx
    ,
    u
    , and
    w
    should remain on the device, and no copy back from device to host will occur.
When the
target exit data
directive (on line 92) is encountered:
  • Since array
    w
    appears in a
    map
    clause with the
    from
    map-type, the values of array
    w
    on the device are copied to the original array
    w
    on the host.
Using the
target enter data
and
target exit data
pair of directives reduced the runtime on the particular ATS GPU used (1-tile only):
No target enter/exit data version : 0.002503 seconds target enter/exit data version : 0.001560 seconds
LIBOMPTARGET_DEBUG=1 output shows that data partitioning is the same in both examples (with and without
target enter data
and
target exit data
).
Libomptarget --> Launching target execution __omp_offloading_802_b85fc8__Z4main_l47 with pointer 0x0000000001d7f208 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000001d7f208... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1 Target LEVEL0 RTL --> Group sizes = {32, 1, 1} Target LEVEL0 RTL --> Group counts = {1024, 1, 1}
Libomptarget --> Launching target execution __omp_offloading_802_b85fc8__Z4main_l70 with pointer 0x0000000001d7f210 (index=2). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000001d7f210... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1 Target LEVEL0 RTL --> Group sizes = {32, 1, 1} Target LEVEL0 RTL --> Group counts = {1024, 1, 1}
The improvement in performance when using
target enter data
and
target exit data
came from the reduction of data transfers, where we now have the following three data transfers:
$ grep "Libomptarget --> Moving" test_target_enter_exit_data.debug Libomptarget --> Moving 262144 bytes (hst:0x00007fff0188f5d0) -> (tgt:0xffffd556aa690000) Libomptarget --> Moving 2048 bytes (hst:0x00007fff0190f5d0) -> (tgt:0xffffd556aa680000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa6d0000) -> (hst:0x00007fff018cf5d0)

Choosing map-type

For improved performance, it is important that the map-type for a mapped variable matches how the variable is used in the
target
construct.
In the following example, arrays
u
and
dx
are read only in the
target
construct, and array
w
is written to in the
target
construct. However, the map-types for all these variables is (inefficiently) specified to be
tofrom
.
// clang-format off #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include <omp.h> #define P 16 #define BLOCKS 8 #define SIZE (BLOCKS * P * P * P) #define MAX 100 #define scaled_rand() ((rand() % MAX) / (1.0 * MAX)) #define IDX2(i, j) (i * P + j) #define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k) int main(int argc, char *argv[]) { double w[SIZE]; /* output */ double u[SIZE], dx[P * P]; /* input */ double ur, us, ut; /* scalars */ int b, i, j, k, l; /* loop counters */ double start, end; /* timers */ omp_set_default_device(0); /* dummy target region, so as not to measure startup time. */ #pragma omp target { ; } /* initialize input with random values */ srand(0); for (int i = 0; i < SIZE; i++) u[i] = scaled_rand(); for (int i = 0; i < P * P; i++) dx[i] = scaled_rand(); start = omp_get_wtime(); #pragma omp target teams distribute parallel for \ private(b, i, j, k, l) \ map(tofrom: u[0:SIZE], dx[0:P * P]) \ map(tofrom: w [0:SIZE]) for (int n = 0; n < SIZE; n++) { k = n - (n / P) * P; j = (n - k) / P; i = (n - (j * P + k)) / (P * P); b = n / (P * P * P); double ur = 0.; double us = 0.; double ut = 0.; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur * us * ut; } end = omp_get_wtime(); printf("offload: w[0]=%lf time=%lf\n", w[0], end - start); return 0; }
For better performance, the map-type for
u
and
dx
should be
to
, and the map-type for
w
should be
from
, as shown in the following modified example.
#pragma omp target teams distribute parallel for \ private(b, i, j, k, l) \ map(to: u[0:SIZE], dx[0:P * P]) \ map(from: w [0:SIZE]) for (int n = 0; n < SIZE; n++) { k = n - (n / P) * P; j = (n - k) / P; i = (n - (j * P + k)) / (P * P); b = n / (P * P * P); double ur = 0.; double us = 0.; double ut = 0.; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; }
Using more specific map-types (
to
or
from
, instead of
tofrom
), reduced the runtime on the particular ATS GPU used (1-tile only):
tofrom map-types version : 0.002503 seconds to or from map-types version : 0.001115 seconds
LIBOMPTARGET_DEBUG=1 output shows that there are unnecessary data transfers between the host and the device when the
tofrom
map-type is used for
u
,
dx
, and
w
. With
tofrom
, there are six transfers to copy the values of
u
,
dx
, and
w
from the host to the device and vice-versa:
$ grep "Libomptarget --> Moving" test_map_tofrom.debug Libomptarget --> Moving 2048 bytes (hst:0x00007ffe24481360) -> (tgt:0xffffd556aa760000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffe24401360) -> (tgt:0xffffd556aa660000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffe24441360) -> (tgt:0xffffd556aa6a0000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa6a0000) -> (hst:0x00007ffe24441360) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa660000) -> (hst:0x00007ffe24401360) Libomptarget --> Moving 2048 bytes (tgt:0xffffd556aa760000) -> (hst:0x00007ffe24481360)
With the more specific map-types (
to
or
from
), we see only three data transfers: two transfers to copy the values of
u
and
dx
from host to device, and one transfer to copy the values of
w
from device to host:
$ grep "Libomptarget --> Moving" test_map_to_or_from.debug Libomptarget --> Moving 2048 bytes (hst:0x00007ffe4e66b100) -> (tgt:0xffffd556aa760000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffe4e5eb100) -> (tgt:0xffffd556aa660000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa6a0000) -> (hst:0x00

Do Not Map Read-Only Scalar Variables

The compiler will produce more efficient code if read-only scalar variables in a
target
construct are not mapped, but are listed in a
firstprivate
clause on the
target
construct or not listed in any clause at all. (Note that when a scalar variable is not listed in any clause on the
target
construct, it will be
firstprivate
by default.)
Listing a read-only scalar variable on a
map(to: )
clause causes unnecessary memory allocation on the device and copying of data from the host to the device. On the other hand, when a read-only scalar is specified to be
firstprivate
on the
target
construct, the variable is passed as argument when launching the kernel, and no memory allocation or copying for the variable is required.
In the following example, a loop nest is offloaded onto the GPU. In the
target
construct, the three scalar variables,
s1
,
s2
, and
s3
, are read-only and are listed in a
map(to: )
clause.
// clang-format off #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include <omp.h> #define P 16 #define BLOCKS 8 #define SIZE (BLOCKS * P * P * P) #define MAX 100 #define scaled_rand() ((rand() % MAX) / (1.0 * MAX)) #define IDX2(i, j) (i * P + j) #define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k) int main(int argc, char *argv[]) { double w[SIZE]; /* output */ double u[SIZE], dx[P * P]; /* input */ double s1, s2, s3; /* scalars */ int b, i, j, k, l; /* loop counters */ double start, end; /* timers */ omp_set_default_device(0); /* dummy target region, so as not to measure startup time. */ #pragma omp target { ; } /* initialize input with random values */ srand(0); for (int i = 0; i < SIZE; i++) u[i] = scaled_rand(); for (int i = 0; i < P * P; i++) dx[i] = scaled_rand(); /* initialize scalars */ s1 = u[SIZE / 2]; s2 = scaled_rand(); s3 = 0.145; /* map data to device */ #pragma omp target enter data map(to: u[0:SIZE], dx[0:P * P]) start = omp_get_wtime(); /* offload the kernel with collapse clause */ #pragma omp target teams distribute parallel for collapse(4) \ map(to: s1, s2, s3) private(b, i, j, k, l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { double ur = 0.; double us = 0.; double ut = 0.; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)] + s1; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)] - s2; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)] * s3; } w[IDX4(b, i, j, k)] = ur * us * ut; } } } } end = omp_get_wtime(); #pragma omp target exit data map(from: w[0:SIZE]) /* print result */ printf("collapse-clause: w[0]=%lf time=%lf\n", w[0], end - start); return 0; }
It is more efficient to list
s1
,
s2
, and
s3
in a
firstprivate
clause on the
target
construct, as shown in the modified example below, or not list them in any clause at all.
#pragma omp target teams distribute parallel for collapse(4) \ firstprivate(s1, s2, s3) private(b, i, j, k, l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { double ur = 0.; double us = 0.; double ut = 0.; for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)] + s1; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)] - s2; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)] * s3; } w[IDX4(b, i, j, k)] = ur * us * ut; } } } }
Using
firstprivate(s1, s2, s3)
, instead of
map(to:s1, s2, s3)
, reduced the runtime on the particular ATS GPU used (1-tile only):
map(to:s1,s2,s3) version : 0.001324 seconds firstprivate(s1,s2,s3) version : 0.000730 seconds
LIBOMPTARGET_DEBUG=1 output shows that data partitioning is the same in both examples (with
map(to:s1, s2, s3)
and with
firstprivate(to:s1, s2, s3)
.
Libomptarget --> Launching target execution __omp_offloading_802_b85fc0__Z4main_l51 with pointer 0x00000000024dbc98 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x00000000024dbc98... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1 Target LEVEL0 RTL --> Group sizes = {32, 1, 1} Target LEVEL0 RTL --> Group counts = {1024, 1, 1}
Libomptarget --> Launching target execution __omp_offloading_802_b85fc4__Z4main_l51 with pointer 0x0000000002289c98 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000002289c98... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1 Target LEVEL0 RTL --> Group sizes = {32, 1, 1} Target LEVEL0 RTL --> Group counts = {1024, 1, 1} Target LEVEL0 RTL --> Kernel Pointer argument 0 (value: 0xffffd556aa660000) was set successfully.
However, more device memory allocations and host-to-device data transfers occur when the
map(to:s1, s2, s3)
clause is used.
LIBOMPTARGET_DEBUG=1 output shows the following data about memory allocations on the device when
map(to:s1, s2, s3)
clause is used.
Target LEVEL0 RTL --> Memory usage for device memory, device 0: Target LEVEL0 RTL --> -- Allocator: Native, Pool Target LEVEL0 RTL --> -- Requested: 1179648, 526360 Target LEVEL0 RTL --> -- Allocated: 1179648, 526432 Target LEVEL0 RTL --> -- Freed : 1179648, 262240 Target LEVEL0 RTL --> -- InUse : 0, 264192 Target LEVEL0 RTL --> -- PeakUse : 1179648, 526432 Target LEVEL0 RTL --> -- NumAllocs: 3, 6
Note that the memory allocated is 1,179,649 bytes, and the number of allocations (from the pool) is 6 – for the three arrays (
dx
,
u
, and
w
) and the three scalars (
s1
,
s2
, and
s3
).
In contrast, LIBOMPTARGET_DEBUG=1 output shows fewer memory allocations on the device when the
firstprivate(s1, s2, s3)
clause is used. The memory allocated is reduced from 1,179,648 to 1,114,112 bytes (a reduction of 64 kilobytes), and the number of allocations (from the pool) is reduced from 6 to 3, as shown below.
Target LEVEL0 RTL --> Memory usage for device memory, device 0: Target LEVEL0 RTL --> -- Allocator: Native, Pool Target LEVEL0 RTL --> -- Requested: 1114112, 526336 Target LEVEL0 RTL --> -- Allocated: 1114112, 526336 Target LEVEL0 RTL --> -- Freed : 1114112, 262144 Target LEVEL0 RTL --> -- InUse : 0, 264192 Target LEVEL0 RTL --> -- PeakUse : 1114112, 526336 Target LEVEL0 RTL --> -- NumAllocs: 2, 3
In addition to more memory allocations, using the
map(to: )
clause results in are more data transfers from host to device. This can be seen by grepping for
"Libomptarget --> Moving"
in the LIBOMPTARGET_DEBUG=1 output:
$ grep "Libomptarget --> Moving" test_scalars_map.debug Libomptarget --> Moving 262144 bytes (hst:0x00007fff85a9b000) -> (tgt:0xffffd556aa670000) Libomptarget --> Moving 2048 bytes (hst:0x00007fff85b1b000) -> (tgt:0xffffd556aa660000) Libomptarget --> Moving 8 bytes (hst:0x00007fff85b1b950) -> (tgt:0xffffd556aa650000) Libomptarget --> Moving 8 bytes (hst:0x00007fff85b1b958) -> (tgt:0xffffd556aa650020) Libomptarget --> Moving 8 bytes (hst:0x00007fff85b1b960) -> (tgt:0xffffd556aa650040) Libomptarget --> Moving 262144 bytes (hst:0x00007fff85adb000) -> (tgt:0xffffd556aa6b0000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa6b0000) -> (hst:0x00007fff85adb000)
In contrast, when the
firstprivate(to:s1, s2, s3)
clause is used, LIBOMPTARGET_DEBUG=1 output shows:
$ grep "Libomptarget --> Moving" test_scalars_fp.debug Libomptarget --> Moving 262144 bytes (hst:0x00007ffd9d207140) -> (tgt:0xffffd556aa670000) Libomptarget --> Moving 2048 bytes (hst:0x00007ffd9d287140) -> (tgt:0xffffd556aa660000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffd9d247140) -> (tgt:0xffffd556aa6b0000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa6b0000) -> (hst:0x00007ffd9d247140)
Note that in the example with
map(to:s1, s2, s3)
we have three additional data transfers, each moving 8 bytes. These transfers are for copying the values of
s1
,
s2
, and
s3
from host to device.
Do Not Map Loop Bounds to Get Better ND-Range Partitioning
As mentioned above, the compiler will produce more efficient code if read-only scalar variables in a
target
construct are not mapped, but are listed in a
firstprivate
clause on the
target
construct or not listed in any clause at all.
This is especially true when the scalars in question are parallel loop bounds in the
target
construct. If any of the loop bounds (lower bound, upper bound, or step) are mapped, then this will result in unnecessary memory allocation on the device and copying of data from host to device. Loop partitioning will also be affected, and may result in non-optimal ND-range partitioning that negatively impacts performance.
Consider the following example, where a
parallel for
loop is offloaded onto the GPU. The upper bound of the for loop is the scalar variable
upper
, which is mapped by the
target
construct (on line 49).
// clang-format off #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include <omp.h> #define P 16 #define BLOCKS 8 #define SIZE (BLOCKS * P * P * P) #define MAX 100 #define scaled_rand() ((rand() % MAX) / (1.0 * MAX)) #define IDX2(i, j) (i * P + j) #define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k) int main(int argc, char *argv[]) { double w[SIZE]; /* output */ double u[SIZE], dx[P * P]; /* input */ double ur, us, ut; /* scalars */ int b, i, j, k, l; /* loop counters */ int upper; double start, end; /* timers */ omp_set_default_device(0); /* dummy target region, so as not to measure startup time. */ #pragma omp target { ; } /* initialize input with random values */ srand(0); for (int i = 0; i < SIZE; i++) u[i] = scaled_rand(); for (int i = 0; i < P * P; i++) dx[i] = scaled_rand(); upper = (int)dx[0] + SIZE; /* map data to device */ #pragma omp target enter data map(to: u[0:SIZE], dx[0:P * P]) start = omp_get_wtime(); /* offload kernel */ #pragma omp target teams distribute parallel for private(b, i, j, k, l) \ map(to: upper) for (int n = 0; n < upper; n++) { double ur = 0.; double us = 0.; double ut = 0.; k = n - (n / P) * P; j = (n - k) / P; i = (n - (j * P + k)) / (P * P); b = n / (P * P * P); for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur * us * ut; } end = omp_get_wtime(); /* map data from device */ #pragma omp target exit data map(from: w[0:SIZE]) printf("offload: w[0]=%lf time=%lf\n", w[0], end - start); return 0; }
Since
upper
is mapped, the value of the variable
upper
on the host may be different from the value on the device. Because of this, when the target region is offloaded at runtime, the number of loop iterations in the offloaded loop is not known on the host. In this case, the runtime (libomptarget.so) will use device and kernel properties to choose ND-range partitioning that fills the whole GPU.
The compiler-generated code for the offloaded loop includes an additional innermost loop (per work-item) inside the offloaded loop. If the global size selected happens to be smaller than the actual number of loop iterations, each work-item will process multiple iterations of the original loop. If the global size selected is larger than the actual number of loop iterations, some of the work-items will not do any work. An if-condition inside the loop generated by the compiler will check this and skip the rest of the loop body.
For the above example (where
upper
is mapped), LIBOMPTARGET_DEBUG=1 shows the following ND-range partitioning.
Libomptarget --> Launching target execution __omp_offloading_802_b85fdf__Z4main_l49 with pointer 0x0000000001e66da8 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x0000000001e66da8... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Group sizes = {1024, 1, 1} Target LEVEL0 RTL --> Group counts = {240, 1, 1}
Note that in the above partitioning, the total number of work-items = 240 x 1024 = 245,760, which is larger than the actual number of loop iterations (32,767). So some of the work-items will not do any work.
Better ND-range partitioning is achieved if the number of loop iterations in the offloaded loop is known on the host. This allows the compiler and runtime to do an ND-range partitioning that matches the number of loop iterations.
To get this better partitioning, we use
firstprivate(upper)
instead of
map(to:upper)
on the
target
construct, as shown in the modified example below. This way, the compiler knows that the value of the variable
upper
on the host is the same as the value of the variable
upper
on the device.
#pragma omp target teams distribute parallel for private(b, i, j, k, l) \ firstprivate(upper) for (int n = 0; n < upper; n++) { double ur = 0.; double us = 0.; double ut = 0.; k = n - (n / P) * P; j = (n - k) / P; i = (n - (j * P + k)) / (P * P); b = n / (P * P * P); for (l = 0; l < P; l++) { ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur * us * ut; }
For the modified example (where
upper
is firstprivate), LIBOMPTARGET_DEBUG=1 shows the following ND-range partitioning.
Libomptarget --> Launching target execution __omp_offloading_802_b85fe0__Z4main_l49 with pointer 0x00000000016d97d8 (index=1). Libomptarget --> Manifesting used target pointers: Target LEVEL0 RTL --> Executing a kernel 0x00000000016d97d8... Target LEVEL0 RTL --> Assumed kernel SIMD width is 16 Target LEVEL0 RTL --> Preferred group size is multiple of 32 Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1 Target LEVEL0 RTL --> Group sizes = {32, 1, 1} Target LEVEL0 RTL --> Group counts = {1024, 1, 1}
Note that in the above partitioning, the total number of work-items = 1024 x 32 = 32,767, which exactly matches the actual number of loop iterations.
Using
firstprivate(upper)
instead of
map(to:upper)
reduced the runtime on the particular ATS GPU used (1-tile only):
map(to:upper) version : 0.002432 seconds firstprivate(upper) version : 0.001442 seconds

Allocating Memory on the Device

As mentioned above, the
map
clause determines how an original host variable is mapped to a corresponding variable on the device. However, the
map(to: )
clause may not be the most efficient way to allocate memory for a variable on the device.
In the following example, the variables
ur
,
us
, and
ut
are used as work (temporary) arrays in the computations on the device. The arrays are mapped to the device using
map(to: )
clauses (lines 46-48).
// clang-format off #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include <omp.h> #define P 16 #define BLOCKS 8 #define SIZE (BLOCKS * P * P * P) #define MAX 100 #define scaled_rand() ((rand() % MAX) / (1.0 * MAX)) #define IDX2(i, j) (i * P + j) #define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k) int main(int argc, char *argv[]) { double w[SIZE]; /* output */ double u[SIZE], dx[P * P]; /* input */ double ur[SIZE], us[SIZE], ut[SIZE]; /* work arrays */ int b, i, j, k, l; /* loop counters */ double start, end; /* timers */ omp_set_default_device(0); /* dummy target region, so as not to measure startup time. */ #pragma omp target { ; } /* initialize input with random values */ srand(0); for (int i = 0; i < SIZE; i++) u[i] = scaled_rand(); for (int i = 0; i < P * P; i++) dx[i] = scaled_rand(); start = omp_get_wtime(); /* offload the kernel */ #pragma omp target teams distribute parallel for simd simdlen(16) collapse(4) \ map(to:u[0:SIZE],dx[0:P*P]) \ map(from:w[0:SIZE]) \ map(to:ur[0:SIZE]) \ map(to:us[0:SIZE]) \ map(to:ut[0:SIZE]) \ private(b,i,j,k,l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { w[IDX4(b, i, j, k)] = 0.; ur[IDX4(b, i, j, k)] = 0.; us[IDX4(b, i, j, k)] = 0.; ut[IDX4(b, i, j, k)] = 0.; for (l = 0; l < P; l++) { ur[IDX4(b, i, j, k)] += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us[IDX4(b, i, j, k)] += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut[IDX4(b, i, j, k)] += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur[IDX4(b, i, j, k)] * us[IDX4(b, i, j, k)] * ut[IDX4(b, i, j, k)]; } } } } end = omp_get_wtime(); /* print result */ printf("collapse-clause: w[0]=%lf time=%lf\n", w[0], end - start); return 0; }
The amount of data transferred between host and device can be seen in LIBOMPTARGET_DEBUG=1 output by grepping for
"Libomptarget --> Moving"
. The output shows that the
map(to: )
clauses for the arrays
ur
,
us
, and
ut
cause the transfer of 262,144 bytes from host to device for each of the arrays:
$ grep "Libomptarget --> Moving" test_map_to.debug Libomptarget --> Moving 262144 bytes (hst:0x00007ffefbe20e50) -> (tgt:0xffffd556aa6b0000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffefbe60e50) -> (tgt:0xffffd556aa6f0000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffefbea0e50) -> (tgt:0xffffd556aa730000) Libomptarget --> Moving 2048 bytes (hst:0x00007ffefbf60e50) -> (tgt:0xffffd556aa660000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffefbee0e50) -> (tgt:0xffffd556aa560000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa670000) -> (hst:0x00007ffefbf20e50)
These data transfers are wasteful because the arrays
ur
,
us
, and
ut
are simply used as temporary work arrays on the device. A better approach would be to place the declarations of the arrays between the
declare target
and
end declare target
directives. This indicates that the arrays are mapped to the device data environment, but no data transfers for these arrays occur unless the
target update
directive is used to manage the consistency of the arrays between host and device. This approach is illustrated in the following modified example.
// clang-format off #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include <omp.h> #define P 16 #define BLOCKS 8 #define SIZE (BLOCKS * P * P * P) #define MAX 100 #define scaled_rand() ((rand() % MAX) / (1.0 * MAX)) #define IDX2(i, j) (i * P + j) #define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k) #pragma omp declare target double ur[SIZE], us[SIZE], ut[SIZE]; /* work arrays */ #pragma omp end declare target int main(int argc, char *argv[]) { double w[SIZE]; /* output */ double u[SIZE], dx[P * P]; /* input */ int b, i, j, k, l; /* loop counters */ double start, end; /* timers */ omp_set_default_device(0); /* dummy target region, so as not to measure startup time. */ #pragma omp target { ; } /* initialize input with random values */ srand(0); for (int i = 0; i < SIZE; i++) u[i] = scaled_rand(); for (int i = 0; i < P * P; i++) dx[i] = scaled_rand(); start = omp_get_wtime(); /* offload the kernel */ #pragma omp target teams distribute parallel for simd simdlen(16) collapse(4) \ map(to:u[0:SIZE],dx[0:P*P]) \ map(from:w[0:SIZE]) \ private(b,i,j,k,l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { w[IDX4(b, i, j, k)] = 0.; ur[IDX4(b, i, j, k)] = 0.; us[IDX4(b, i, j, k)] = 0.; ut[IDX4(b, i, j, k)] = 0.; for (l = 0; l < P; l++) { ur[IDX4(b, i, j, k)] += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us[IDX4(b, i, j, k)] += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut[IDX4(b, i, j, k)] += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur[IDX4(b, i, j, k)] * us[IDX4(b, i, j, k)] * ut[IDX4(b, i, j, k)]; } } } } end = omp_get_wtime(); /* print result */ printf("collapse-clause: w[0]=%lf time=%lf\n", w[0], end - start); return 0; }
In the above modified example, memory is allocated for arrays
ur
,
us
, and
ut
on the device, but no data transfers for these arrays take place. This is seen by grepping for
"Libomptarget --> Moving"
in LIBOMPTARGET_DEBUG=1 output. We no longer see the transfer of 262,144 bytes from host to device for each of the arrays:
$ grep "Libomptarget --> Moving" test_declare_target.debug Libomptarget --> Moving 2048 bytes (hst:0x00007fff12eaefe0) -> (tgt:0xffffd556aa660000) Libomptarget --> Moving 262144 bytes (hst:0x00007fff12e2efe0) -> (tgt:0xffffd556aa6b0000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa670000) -> (hst:0x00007fff12e6e
An alternative approach for allocating memory on the device, without transferring any data between host and device, uses the
map(alloc: )
clause instead of the
map(to: )
clause, as shown below (lines 46-48).
// clang-format off #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include <omp.h> #define P 16 #define BLOCKS 8 #define SIZE (BLOCKS * P * P * P) #define MAX 100 #define scaled_rand() ((rand() % MAX) / (1.0 * MAX)) #define IDX2(i, j) (i * P + j) #define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k) int main(int argc, char *argv[]) { double w[SIZE]; /* output */ double u[SIZE], dx[P * P]; /* input */ double ur[SIZE], us[SIZE], ut[SIZE]; /* work arrays */ int b, i, j, k, l; /* loop counters */ double start, end; /* timers */ omp_set_default_device(0); /* dummy target region, so as not to measure startup time. */ #pragma omp target { ; } /* initialize input with random values */ srand(0); for (int i = 0; i < SIZE; i++) u[i] = scaled_rand(); for (int i = 0; i < P * P; i++) dx[i] = scaled_rand(); start = omp_get_wtime(); /* offload the kernel */ #pragma omp target teams distribute parallel for simd simdlen(16) collapse(4) \ map(to:u[0:SIZE],dx[0:P*P]) \ map(from:w[0:SIZE]) \ map(alloc:ur[0:SIZE]) \ map(alloc:us[0:SIZE]) \ map(alloc:ut[0:SIZE]) \ private(b,i,j,k,l) for (b = 0; b < BLOCKS; b++) { for (i = 0; i < P; i++) { for (j = 0; j < P; j++) { for (k = 0; k < P; k++) { w[IDX4(b, i, j, k)] = 0.; ur[IDX4(b, i, j, k)] = 0.; us[IDX4(b, i, j, k)] = 0.; ut[IDX4(b, i, j, k)] = 0.; for (l = 0; l < P; l++) { ur[IDX4(b, i, j, k)] += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)]; us[IDX4(b, i, j, k)] += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)]; ut[IDX4(b, i, j, k)] += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)]; } w[IDX4(b, i, j, k)] = ur[IDX4(b, i, j, k)] * us[IDX4(b, i, j, k)] * ut[IDX4(b, i, j, k)]; } } } } end = omp_get_wtime(); /* print result */ printf("collapse-clause: w[0]=%lf time=%lf\n", w[0], end - start); return 0; }
In the above example, the
map(alloc: )
clauses for arrays
ur
,
us
, and
ut
cause memory to be allocated for
ur
,
us
, and
ut
on the device, and no data transfers occur – as in the
declare target
and
end declare target
case:
$ grep "Libomptarget --> Moving" test_map_alloc.debug Libomptarget --> Moving 2048 bytes (hst:0x00007ffc6492ba80) -> (tgt:0xffffd556aa660000) Libomptarget --> Moving 262144 bytes (hst:0x00007ffc648aba80) -> (tgt:0xffffd556aa560000) Libomptarget --> Moving 262144 bytes (tgt:0xffffd556aa670000) -> (hst:0x00007ffc648eba8
The performance of the various versions when running on the particular ATS GPU used (1-tile only) was as follows:
map(to: ) version : 0.002039 seconds declare target / end declare target version : 0.001272 seconds map(alloc: ) version : 0.001390 seconds

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.