Contents

# Minimizing Data Transfers and Memory Allocations

When offloading computations onto the GPU, it is important to minimize data transfers between the host and the device, and reduce memory allocations on the device. There are various ways to achieve this, as described below.

## Use target enter data and target exit data Directives

When variables are used by multiple
target
constructs, the
target enter data
and
target exit data
pair of directives can be used to minimize data transfers between host and device.
Place the
target enter data
directive before the first
target
construct to transfer data from host to device, and place the
target exit data
directive after the last
target
construct to transfer data from device to host.
Consider the following example where we have two
target
constructs (on lines 47 and 71), and each
target
dx
and
u
and and writes to array
w
.
``````//==============================================================
//
// =============================================================
// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <math.h>
#include <omp.h>

#define P 16
#define BLOCKS 8
#define SIZE (BLOCKS * P * P * P)

#define MAX 100
#define scaled_rand() ((rand() % MAX) / (1.0 * MAX))

#define IDX2(i, j) (i * P + j)
#define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k)

int main(void) {
double w[SIZE];            /* output */
double u[SIZE], dx[P * P]; /* input */
int b, i, j, k, l;         /* loop counters */
double start, end;         /* timers */

omp_set_default_device(0);

/* dummy target region, so as not to measure startup time. */
#pragma omp target
{ ; }

/* initialize input with random values */
srand(0);
for (int i = 0; i < SIZE; i++)
u[i] = scaled_rand();

for (int i = 0; i < P * P; i++)
dx[i] = scaled_rand();

start = omp_get_wtime();

#pragma omp target teams distribute parallel for collapse(4) \
map(to: u[0:SIZE], dx[0:P * P]) map(from: w[0:SIZE]) \
private(b, i, j, k, l)
for (b = 0; b < BLOCKS; b++) {
for (i = 0; i < P; i++) {
for (j = 0; j < P; j++) {
for (k = 0; k < P; k++) {
double ur = 0.;
double us = 0.;
double ut = 0.;

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] = ur * us * ut;
}
}
}
}

#pragma omp target teams distribute parallel for collapse(4) \
map(to: u[0:SIZE], dx[0:P * P]) map(tofrom: w[0:SIZE]) \
private(b, i, j, k, l)
for (b = 0; b < BLOCKS; b++) {
for (i = 0; i < P; i++) {
for (j = 0; j < P; j++) {
for (k = 0; k < P; k++) {
double ur = b + i + j - k;
double us = b + i + j - k;
double ut = b + i + j - k;

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] += ur * us * ut;
}
}
}
}

end = omp_get_wtime();

/* print result */
printf("target region: w[0]=%lf time=%lf\n", w[0], end - start);

return 0;
}
``````
Compilation command:
``icx-cl -fiopenmp -fopenmp-targets=spir64 test_no_target_enter_exit_data.cpp``
Run command:
``OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=1 ./a.out``
When the first
target
construct (on line 47) is encountered:
• Since arrays
dx
and
u
appear in a
map
clause with the
to
map-type, storage is allocated for the arrays on the device, and the values of
dx
and
u
on the host are copied to the corresponding arrays on the device.
• Since array
w
appears in a
map
clause with the
from
map-type, uninitialized storage is allocated for array
w
on the device.
At the end of the first
target
region:
• Since array
w
appears in a
map
clause with the
from
map-type, the values of array
w
on the device are copied to the original array
w
on the host.
When the second
target
construct (on line 71) is encountered:
• Since arrays
dx
,
u
, and
w
appear in a
map
clause with the
to
map-type, storage is allocated for arrays
dx
,
u
, and
w
on the device and the values of arrays
dx
,
u
, and
w
on the host are copied to the corresponding arrays on the device.
At the end of the second
target
region:
• Since array
w
appears in a
map
clause with the
from
map-type, the values of array
w
on the device are copied to the original array
w
on the host.
LIBOMPTARGET_DEBUG=1 output shows that both
target
regions (on lines 47 and 71) have the same data partitioning.
``````Libomptarget --> Launching target execution __omp_offloading_3d_15ece5c8__Z4main_l42 with pointer 0x00000000024cb5d8 (index=1).
Target LEVEL0 RTL --> Executing a kernel 0x00000000024cb5d8...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 32
Target LEVEL0 RTL --> Preferred group size is multiple of 64
Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1
Target LEVEL0 RTL --> Group sizes = {64, 1, 1}
Target LEVEL0 RTL --> Group counts = {512, 1, 1}
``````
``````Target LEVEL0 RTL --> Executing a kernel 0x0000000002b9c5e0...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 32
Target LEVEL0 RTL --> Preferred group size is multiple of 64
Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1
Target LEVEL0 RTL --> Group sizes = {64, 1, 1}
Target LEVEL0 RTL --> Group counts = {512, 1, 1}
Target LEVEL0 RTL --> Kernel Pointer argument 0 (value: 0xff00ffffffee0000) was set successfully for device 0.
``````
The amount of data transferred (for both
target
regions) can be seen in LIBOMPTARGET_DEBUG=1 output by grepping for
"Libomptarget --> Moving"
:
``````\$ grep "Libomptarget --> Moving" test_no_target_enter_exit_data.debug
Libomptarget --> Moving 2048 bytes (hst:0x00007fff60f05030) -> (tgt:0xff00fffffffe0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007fff60ec5030) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00fffffff20000) -> (hst:0x00007fff60e85030)
Libomptarget --> Moving 2048 bytes (hst:0x00007fff60f05030) -> (tgt:0xff00fffffffe0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007fff60ec5030) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007fff60e85030) -> (tgt:0xff00fffffff20000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00fffffff20000) -> (hst:0x00007fff60e85030)``````
You can reduce the copying of data from host to device and vice versa by using the
target enter data
and
target exit data
directives as shown in this modified example.
``````//==============================================================
//
// =============================================================
// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <math.h>
#include <omp.h>

#define P 16
#define BLOCKS 8
#define SIZE (BLOCKS * P * P * P)

#define MAX 100
#define scaled_rand() ((rand() % MAX) / (1.0 * MAX))

#define IDX2(i, j) (i * P + j)
#define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k)

int main(void) {
double w[SIZE];            /* output */
double u[SIZE], dx[P * P]; /* input */
int b, i, j, k, l;         /* loop counters */
double start, end;         /* timers */

omp_set_default_device(0);

/* dummy target region, so as not to measure startup time. */
#pragma omp target
{ ; }

/* initialize input with random values */
srand(0);
for (int i = 0; i < SIZE; i++)
u[i] = scaled_rand();

for (int i = 0; i < P * P; i++)
dx[i] = scaled_rand();

start = omp_get_wtime();

/* map data to device. alloc for w avoids map(tofrom: w[0:SIZE])
on target by default. */
#pragma omp target enter data map(to: u[0:SIZE], dx[0:P * P]) \
map(alloc: w[0:SIZE])

#pragma omp target teams distribute parallel for collapse(4) \
private(b, i, j, k, l)
for (b = 0; b < BLOCKS; b++) {
for (i = 0; i < P; i++) {
for (j = 0; j < P; j++) {
for (k = 0; k < P; k++) {
double ur = 0.;
double us = 0.;
double ut = 0.;

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] = ur * us * ut;
}
}
}
}

#pragma omp target teams distribute parallel for collapse(4) \
private(b, i, j, k, l)
for (b = 0; b < BLOCKS; b++) {
for (i = 0; i < P; i++) {
for (j = 0; j < P; j++) {
for (k = 0; k < P; k++) {
double ur = b + i + j - k;
double us = b + i + j - k;
double ut = b + i + j - k;

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] += ur * us * ut;
}
}
}
}

#pragma omp target exit data map(from: w[0:SIZE])

end = omp_get_wtime();

/* print result */
printf("target region: w[0]=%lf time=%lf\n", w[0], end - start);

return 0;
}
``````
In the modified example, when the
target enter data
directive (on line 48) is encountered:
• Since arrays
dx
and
u
appear in a
map
clause with the
to
map-type, storage is allocated for arrays
dx
and
u
on the device, and the values of arrays
dx
and
u
on the host are copied to the corresponding arrays on the device.
• Since array
w
appears in a
map
clause with the
alloc
map-type, uninitialized storage is allocated for array
w
on the device.
When the first
target
construct (on line 52) is encountered:
• The runtime checks whether storage corresponding to arrays
dx
,
u
, and
w
already exists on the device. Since it does, no data transfer occurs.
At the end of the first
target
region:
• The runtime will recognize that the storage for arrays
dx
,
u
, and
w
should remain on the device, and no copy back from the device to the host occurs.
When the second
target
construct (on line 75) is encountered:
• Again no data transfer from the host to the device occurs.
At the end of the second
target
region:
• The runtime will recognize that the storage for the arrays
dx
,
u
, and
w
should remain on the device, and no copy back from device to host will occur.
When the
target exit data
directive (on line 97) is encountered:
• Since array
w
appears in a
map
clause with the
from
map-type, the values of array
w
on the device are copied to the original array
w
on the host.
Using the
target enter data
and
target exit data
pair of directives reduced the runtime on the particular GPU used (1-tile only):
``````No target enter/exit data version : 0.001204 seconds
target enter/exit data version    : 0.000934 seconds``````
LIBOMPTARGET_DEBUG=1 output shows that data partitioning is the same in both examples (with and without
target enter data
and
target exit data
).
``````Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd899939c0, Size=2048)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffd899939c0, TgtPtrBegin=0xff00ffffffee0000, Size=2048, DynRefCount=2 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0xff00ffffffee0000, Offset: 0) from host pointer 0x00007ffd899939c0
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd899539c0, Size=262144)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffd899539c0, TgtPtrBegin=0xff00ffffffef0000, Size=262144, DynRefCount=2 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0xff00ffffffef0000, Offset: 0) from host pointer 0x00007ffd899539c0
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd899139c0, Size=262144)...
``````
``````Libomptarget --> Launching target execution __omp_offloading_3d_fadb4d__Z4main_l47 with pointer 0x0000000002b9c5d8 (index=1).
Target LEVEL0 RTL --> Executing a kernel 0x0000000002b9c5d8...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 32
Target LEVEL0 RTL --> Preferred group size is multiple of 64
Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1
Target LEVEL0 RTL --> Group sizes = {64, 1, 1}
Target LEVEL0 RTL --> Group counts = {512, 1, 1}
``````
The improvement in performance when using
target enter data
and
target exit data
came from the reduction of data transfers, where we now have the following three data transfers:
``````\$ grep "Libomptarget --> Moving" test_target_enter_exit_data.debug
Libomptarget --> Moving 262144 bytes (hst:0x00007ffd899539c0) -> (tgt:0xff00ffffffef0000)
Libomptarget --> Moving 2048 bytes (hst:0x00007ffd899939c0) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00fffffff30000) -> (hst:0x00007ffd899139c0)``````

## Choose map-type Appropriately

For improved performance, it is important that the map-type for a mapped variable matches how the variable is used in the
target
construct.
In the following example, arrays
u
and
dx
target
construct, and array
w
is written to in the
target
construct. However, the map-types for all these variables is (inefficiently) specified to be
tofrom
.
``````//==============================================================
//
// =============================================================
// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <math.h>
#include <omp.h>

#define P 16
#define BLOCKS 8
#define SIZE (BLOCKS * P * P * P)

#define MAX 100
#define scaled_rand() ((rand() % MAX) / (1.0 * MAX))

#define IDX2(i, j) (i * P + j)
#define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k)

int main(void) {
double w[SIZE];            /* output */
double u[SIZE], dx[P * P]; /* input */
int b, i, j, k, l;         /* loop counters */
double start, end;         /* timers */

omp_set_default_device(0);

/* dummy target region, so as not to measure startup time. */
#pragma omp target
{ ; }

/* initialize input with random values */
srand(0);
for (int i = 0; i < SIZE; i++)
u[i] = scaled_rand();

for (int i = 0; i < P * P; i++)
dx[i] = scaled_rand();

start = omp_get_wtime();

#pragma omp target teams distribute parallel for \
private(b, i, j, k, l) \
map(tofrom: u[0:SIZE], dx[0:P * P]) \
map(tofrom: w [0:SIZE])
for (int n = 0; n < SIZE; n++) {
k = n - (n / P) * P;
j = (n - k) / P;
i = (n - (j * P + k)) / (P * P);
b = n / (P * P * P);

double ur = 0.;
double us = 0.;
double ut = 0.;

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] = ur * us * ut;
}

end = omp_get_wtime();

printf("offload: w[0]=%lf time=%lf\n", w[0], end - start);

return 0;
}
``````
Compilation command:
``icpx -fiopenmp -fopenmp-targets=spir64 test_map_tofrom.cpp``
Run command:
``OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=1 ./a.out``
For better performance, the map-type for
u
and
dx
should be
to
, and the map-type for
w
should be
from
, as shown in the following modified example.
``````  #pragma omp target teams distribute parallel for \
private(b, i, j, k, l) \
map(to: u[0:SIZE], dx[0:P * P]) \
map(from: w [0:SIZE])
for (int n = 0; n < SIZE; n++) {
k = n - (n / P) * P;
j = (n - k) / P;
i = (n - (j * P + k)) / (P * P);
b = n / (P * P * P);

double ur = 0.;
double us = 0.;
double ut = 0.;

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] = ur * us * ut;
}
``````
Using more specific map-types (
to
or
from
tofrom
), reduced the runtime on the particular GPU used (1-tile only):
``````tofrom map-types version     : 0.001141 seconds
to or from map-types version : 0.000908  seconds``````
LIBOMPTARGET_DEBUG=1 output shows that there are unnecessary data transfers between the host and the device when the
tofrom
map-type is used for
u
,
dx
, and
w
. With
tofrom
, there are six transfers to copy the values of
u
,
dx
, and
w
from the host to the device and vice-versa:
``````\$ grep "Libomptarget --> Moving" test_map_tofrom.debug
Libomptarget --> Moving 2048 bytes (hst:0x00007fff1f6ad540) -> (tgt:0xff00fffffffe0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007fff1f66d540) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007fff1f62d540) -> (tgt:0xff00fffffff20000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00fffffff20000) -> (hst:0x00007fff1f62d540)
Libomptarget --> Moving 262144 bytes (tgt:0xff00ffffffee0000) -> (hst:0x00007fff1f66d540)
Libomptarget --> Moving 2048 bytes (tgt:0xff00fffffffe0000) -> (hst:0x00007fff1f6ad540)``````
With the more specific map-types (
to
or
from
), we see only three data transfers: two transfers to copy the values of
u
and
dx
from host to device, and one transfer to copy the values of
w
from device to host:
``````\$ grep "Libomptarget --> Moving" test_map_to_or_from.debug
Libomptarget --> Moving 2048 bytes (hst:0x00007fffc2258fd0) -> (tgt:0xff00fffffffe0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007fffc2218fd0) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00fffffff20000) -> (hst:0x00007fffc21d8fd0)``````

## Do Not Map Read-Only Scalar Variables

The compiler will produce more efficient code if read-only scalar variables in a
target
construct are not mapped, but are listed in a
firstprivate
clause on the
target
construct or not listed in any clause at all. (Note that when a scalar variable is not listed in any clause on the
target
construct, it will be
firstprivate
by default.)
Listing a read-only scalar variable on a
map(to: )
clause causes unnecessary memory allocation on the device and copying of data from the host to the device. On the other hand, when a read-only scalar is specified to be
firstprivate
on the
target
construct, the variable is passed as argument when launching the kernel, and no memory allocation or copying for the variable is required.
In the following example, a loop nest is offloaded onto the GPU. In the
target
construct, the three scalar variables,
s1
,
s2
, and
s3
, are read-only and are listed in a
map(to: )
clause.
``````//==============================================================
//
// =============================================================
// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <math.h>
#include <omp.h>

#define P 16
#define BLOCKS 8
#define SIZE (BLOCKS * P * P * P)

#define MAX 100
#define scaled_rand() ((rand() % MAX) / (1.0 * MAX))

#define IDX2(i, j) (i * P + j)
#define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k)

int main(void) {
double w[SIZE];            /* output */
double u[SIZE], dx[P * P]; /* input */
double s1, s2, s3;         /* scalars */
int b, i, j, k, l;         /* loop counters */
double start, end;         /* timers */

omp_set_default_device(0);

/* dummy target region, so as not to measure startup time. */
#pragma omp target
{ ; }

/* initialize input with random values */
srand(0);
for (int i = 0; i < SIZE; i++)
u[i] = scaled_rand();

for (int i = 0; i < P * P; i++)
dx[i] = scaled_rand();

/* initialize scalars */
s1 = u[SIZE / 2];
s2 = scaled_rand();
s3 = 0.145;

/* map data to device */
#pragma omp target enter data map(to: u[0:SIZE], dx[0:P * P])

start = omp_get_wtime();

/* offload the kernel with collapse clause */
#pragma omp target teams distribute parallel for collapse(4) \
map(to: s1, s2, s3) private(b, i, j, k, l)
for (b = 0; b < BLOCKS; b++) {
for (i = 0; i < P; i++) {
for (j = 0; j < P; j++) {
for (k = 0; k < P; k++) {
double ur = 0.;
double us = 0.;
double ut = 0.;

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)] + s1;
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)] - s2;
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)] * s3;
}

w[IDX4(b, i, j, k)] = ur * us * ut;
}
}
}
}

end = omp_get_wtime();

#pragma omp target exit data map(from: w[0:SIZE])

/* print result */
printf("collapse-clause: w[0]=%lf time=%lf\n", w[0], end - start);

return 0;
}
``````
Compilation command:
``icpx -fiopenmp -fopenmp-targets=spir64 test_scalars_map.cpp``
Run command:
``OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=1 ./a.out``
It is more efficient to list
s1
,
s2
, and
s3
in a
firstprivate
clause on the
target
construct, as shown in the modified example below, or not list them in any clause at all.
``````  /* offload the kernel with collapse clause */
#pragma omp target teams distribute parallel for collapse(4) \
firstprivate(s1, s2, s3) private(b, i, j, k, l)
for (b = 0; b < BLOCKS; b++) {
for (i = 0; i < P; i++) {
for (j = 0; j < P; j++) {
for (k = 0; k < P; k++) {
double ur = 0.;
double us = 0.;
double ut = 0.;

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)] + s1;
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)] - s2;
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)] * s3;
}

w[IDX4(b, i, j, k)] = ur * us * ut;
}
}
}
}
``````
Using
firstprivate(s1, s2, s3)
map(to:s1, s2, s3)
, reduced the runtime on the particular GPU used (1-tile only):
``````map(to:s1,s2,s3) version       : 0.001324 seconds
firstprivate(s1,s2,s3) version : 0.000730 seconds``````
LIBOMPTARGET_DEBUG=1 output shows that data partitioning is the same in both examples (with
map(to:s1, s2, s3)
and with
firstprivate(to:s1, s2, s3)
.
``````Libomptarget --> Launching target execution __omp_offloading_3d_9b49d7d8__Z4main_l51 with pointer 0x0000000002b295d8 (index=1).
Target LEVEL0 RTL --> Executing a kernel 0x0000000002b295d8...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 32
Target LEVEL0 RTL --> Preferred group size is multiple of 64
Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1
Target LEVEL0 RTL --> Group sizes = {64, 1, 1}
Target LEVEL0 RTL --> Group counts = {512, 1, 1}
``````
``````Libomptarget --> Launching target execution __omp_offloading_3d_9b49d7dd__Z4main_l51 with pointer 0x0000000001f475d8 (index=1).
Target LEVEL0 RTL --> Executing a kernel 0x0000000001f475d8...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 32
Target LEVEL0 RTL --> Preferred group size is multiple of 64
Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1
Target LEVEL0 RTL --> Group sizes = {64, 1, 1}
Target LEVEL0 RTL --> Group counts = {512, 1, 1}
``````
However, more device memory allocations and host-to-device data transfers occur when the
map(to:s1, s2, s3)
clause is used.
LIBOMPTARGET_DEBUG=1 output shows the following data about memory allocations on the device when
map(to:s1, s2, s3)
clause is used.
``````Target LEVEL0 RTL --> Memory usage for device memory, device 0x000000000278e470
Target LEVEL0 RTL --> -- Allocator:       Native,         Pool
Target LEVEL0 RTL --> -- Requested:      1179648,       526360
Target LEVEL0 RTL --> -- Allocated:      1179648,       526528
Target LEVEL0 RTL --> -- Freed    :      1179648,       262336
Target LEVEL0 RTL --> -- InUse    :            0,       264192
Target LEVEL0 RTL --> -- PeakUse  :      1179648,       526528
Target LEVEL0 RTL --> -- NumAllocs:            3,            6
``````
Note that the memory allocated is 1,179,648 bytes, and the number of allocations (from the pool) is 6 – for the three arrays (
dx
,
u
, and
w
) and the three scalars (
s1
,
s2
, and
s3
).
In contrast, LIBOMPTARGET_DEBUG=1 output shows fewer memory allocations on the device when the
firstprivate(s1, s2, s3)
clause is used. The memory allocated is reduced from 1,179,648 to 1,114,112 bytes (a reduction of 64 kilobytes), and the number of allocations (from the pool) is reduced from 6 to 3, as shown below.
``````Target LEVEL0 RTL --> Memory usage for device memory, device 0x0000000001bab440
Target LEVEL0 RTL --> -- Allocator:       Native,         Pool
Target LEVEL0 RTL --> -- Requested:      1114112,       526336
Target LEVEL0 RTL --> -- Allocated:      1114112,       526336
Target LEVEL0 RTL --> -- Freed    :      1114112,       262144
Target LEVEL0 RTL --> -- InUse    :            0,       264192
Target LEVEL0 RTL --> -- PeakUse  :      1114112,       526336
Target LEVEL0 RTL --> -- NumAllocs:            2,            3
``````
In addition to more memory allocations, using the
map(to: )
clause results in are more data transfers from host to device. This can be seen by grepping for
"Libomptarget --> Moving"
in the LIBOMPTARGET_DEBUG=1 output:
``````\$ grep "Libomptarget --> Moving" test_scalars_map.debug
Libomptarget --> Moving 262144 bytes (hst:0x00007ffdf5526760) -> (tgt:0xff00ffffffef0000)
Libomptarget --> Moving 2048 bytes (hst:0x00007ffdf5566760) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 8 bytes (hst:0x00007ffdf55670a0) -> (tgt:0xff00ffffffed0000)
Libomptarget --> Moving 8 bytes (hst:0x00007ffdf55670a8) -> (tgt:0xff00ffffffed0040)
Libomptarget --> Moving 8 bytes (hst:0x00007ffdf55670b0) -> (tgt:0xff00ffffffed0080)
Libomptarget --> Moving 262144 bytes (hst:0x00007ffdf54e6760) -> (tgt:0xff00fffffff30000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00fffffff30000) -> (hst:0x00007ffdf54e6760)``````
In contrast, when the
firstprivate(to:s1, s2, s3)
clause is used, LIBOMPTARGET_DEBUG=1 output shows:
``````\$ grep "Libomptarget --> Moving" test_scalars_fp.debug
Libomptarget --> Moving 262144 bytes (hst:0x00007ffda809c4a0) -> (tgt:0xff00ffffffef0000)
Libomptarget --> Moving 2048 bytes (hst:0x00007ffda80dc4a0) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007ffda805c4a0) -> (tgt:0xff00fffffff30000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00fffffff30000) -> (hst:0x00007ffda805c4a0)``````
Note that in the example with
map(to:s1, s2, s3)
we have three additional data transfers, each moving 8 bytes. These transfers are for copying the values of
s1
,
s2
, and
s3
from host to device.

## Do Not Map Loop Bounds to Get Better ND-Range Partitioning

As mentioned above, the compiler will produce more efficient code if read-only scalar variables in a
target
construct are not mapped, but are listed in a
firstprivate
clause on the
target
construct or not listed in any clause at all.
This is especially true when the scalars in question are parallel loop bounds in the
target
construct. If any of the loop bounds (lower bound, upper bound, or step) are mapped, then this will result in unnecessary memory allocation on the device and copying of data from host to device. Loop partitioning will also be affected, and may result in non-optimal ND-range partitioning that negatively impacts performance.
Consider the following example, where a
parallel for
loop is offloaded onto the GPU. The upper bound of the for loop is the scalar variable
upper
, which is mapped by the
target
construct (on line 53).
``````//==============================================================
//
// =============================================================
// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <math.h>
#include <omp.h>

#define P 16
#define BLOCKS 8
#define SIZE (BLOCKS * P * P * P)

#define MAX 100
#define scaled_rand() ((rand() % MAX) / (1.0 * MAX))

#define IDX2(i, j) (i * P + j)
#define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k)

int main(void) {
double w[SIZE];            /* output */
double u[SIZE], dx[P * P]; /* input */
int b, i, j, k, l;         /* loop counters */
int upper;
double start, end;         /* timers */

omp_set_default_device(0);

/* dummy target region, so as not to measure startup time. */
#pragma omp target
{ ; }

/* initialize input with random values */
srand(0);
for (int i = 0; i < SIZE; i++)
u[i] = scaled_rand();

for (int i = 0; i < P * P; i++)
dx[i] = scaled_rand();

upper = (int)dx[0] + SIZE;

/* map data to device */
#pragma omp target enter data map(to: u[0:SIZE], dx[0:P * P])

start = omp_get_wtime();

#pragma omp target teams distribute parallel for private(b, i, j, k, l) \
map(to: upper)
for (int n = 0; n < upper; n++) {
double ur = 0.;
double us = 0.;
double ut = 0.;

k = n - (n / P) * P;
j = (n - k) / P;
i = (n - (j * P + k)) / (P * P);
b = n / (P * P * P);

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] = ur * us * ut;
}

end = omp_get_wtime();

/* map data from device */
#pragma omp target exit data map(from: w[0:SIZE])

printf("offload: w[0]=%lf time=%lf\n", w[0], end - start);

return 0;
}
``````
Compilation command:
``icpx -fiopenmp -fopenmp-targets=spir64 test_loop_bounds_map.cpp``
Run command:
``OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=1 ./a.out``
Since
upper
is mapped, the value of the variable
upper
on the host may be different from the value on the device. Because of this, when the target region is offloaded at runtime, the number of loop iterations in the offloaded loop is not known on the host. In this case, the runtime (libomptarget.so) will use device and kernel properties to choose ND-range partitioning that fills the whole GPU.
The compiler-generated code for the offloaded loop includes an additional innermost loop (per work-item) inside the offloaded loop. If the global size selected happens to be smaller than the actual number of loop iterations, each work-item will process multiple iterations of the original loop. If the global size selected is larger than the actual number of loop iterations, some of the work-items will not do any work. An if-condition inside the loop generated by the compiler will check this and skip the rest of the loop body.
For the above example (where
upper
is mapped), LIBOMPTARGET_DEBUG=1 shows the following ND-range partitioning.
``````Libomptarget --> Launching target execution __omp_offloading_3d_1ff4bf1c__Z4main_l48 with pointer 0x00000000021175d8 (index=1).
Target LEVEL0 RTL --> Executing a kernel 0x00000000021175d8...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 32
Target LEVEL0 RTL --> Preferred group size is multiple of 64
Target LEVEL0 RTL --> Group sizes = {1024, 1, 1}
Target LEVEL0 RTL --> Group counts = {512, 1, 1}
``````
Note that in the above partitioning, the total number of work-items = 512 x 1024 = 524,288, which is larger than the actual number of loop iterations (32,767). So some of the work-items will not do any work.
Better ND-range partitioning is achieved if the number of loop iterations in the offloaded loop is known on the host. This allows the compiler and runtime to do an ND-range partitioning that matches the number of loop iterations.
To get this better partitioning, we use
firstprivate(upper)
map(to:upper)
on the
target
construct, as shown in the modified example below. This way, the compiler knows that the value of the variable
upper
on the host is the same as the value of the variable
upper
on the device.
``````  #pragma omp target teams distribute parallel for private(b, i, j, k, l) \
firstprivate(upper)
for (int n = 0; n < upper; n++) {
double ur = 0.;
double us = 0.;
double ut = 0.;

k = n - (n / P) * P;
j = (n - k) / P;
i = (n - (j * P + k)) / (P * P);
b = n / (P * P * P);

for (l = 0; l < P; l++) {
ur += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] = ur * us * ut;
}
``````
For the modified example (where
upper
is firstprivate), LIBOMPTARGET_DEBUG=1 shows the following ND-range partitioning.
``````Libomptarget --> Launching target execution __omp_offloading_3d_1fed0edf__Z4main_l48 with pointer 0x00000000029b3d08 (index=1).
Target LEVEL0 RTL --> Executing a kernel 0x00000000029b3d08...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 32
Target LEVEL0 RTL --> Preferred group size is multiple of 64
Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 32767, Stride = 1
Target LEVEL0 RTL --> Group sizes = {64, 1, 1}
Target LEVEL0 RTL --> Group counts = {512, 1, 1}
``````
Note that in the above partitioning, the total number of work-items = 512 x 64 = 32,767, which exactly matches the actual number of loop iterations.
Using
firstprivate(upper)
map(to:upper)
reduced the runtime on the particular GPU used (1-tile only):
``````map(to:upper) version       : 0.000415 seconds
firstprivate(upper) version : 0.000307 seconds``````

## Allocate Memory Directly on the Device

As mentioned above, the
map
clause determines how an original host variable is mapped to a corresponding variable on the device. However, the
map(to: )
clause may not be the most efficient way to allocate memory for a variable on the device.
In the following example, the variables
ur
,
us
, and
ut
are used as work (temporary) arrays in the computations on the device. The arrays are mapped to the device using
map(to: )
clauses (lines 51-53).
``````//==============================================================
//
// =============================================================
// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <math.h>
#include <omp.h>

#define P 16
#define BLOCKS 8
#define SIZE (BLOCKS * P * P * P)

#define MAX 100
#define scaled_rand() ((rand() % MAX) / (1.0 * MAX))

#define IDX2(i, j) (i * P + j)
#define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k)

int main(void) {
double w[SIZE];                      /* output */
double u[SIZE], dx[P * P];           /* input */
double ur[SIZE], us[SIZE], ut[SIZE]; /* work arrays */
int b, i, j, k, l;                   /* loop counters */
double start, end;                   /* timers */

omp_set_default_device(0);

/* dummy target region, so as not to measure startup time. */
#pragma omp target
{ ; }

/* initialize input with random values */
srand(0);
for (int i = 0; i < SIZE; i++)
u[i] = scaled_rand();

for (int i = 0; i < P * P; i++)
dx[i] = scaled_rand();

start = omp_get_wtime();

#pragma omp target teams distribute parallel for simd simdlen(16) collapse(4) \
map(to:u[0:SIZE],dx[0:P*P]) \
map(from:w[0:SIZE])         \
map(to:ur[0:SIZE])          \
map(to:us[0:SIZE])          \
map(to:ut[0:SIZE])          \
private(b,i,j,k,l)
for (b = 0; b < BLOCKS; b++) {
for (i = 0; i < P; i++) {
for (j = 0; j < P; j++) {
for (k = 0; k < P; k++) {
w[IDX4(b, i, j, k)] = 0.;
ur[IDX4(b, i, j, k)] = 0.;
us[IDX4(b, i, j, k)] = 0.;
ut[IDX4(b, i, j, k)] = 0.;

for (l = 0; l < P; l++) {
ur[IDX4(b, i, j, k)] += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us[IDX4(b, i, j, k)] += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut[IDX4(b, i, j, k)] += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] = ur[IDX4(b, i, j, k)] * us[IDX4(b, i, j, k)] *
ut[IDX4(b, i, j, k)];
}
}
}
}

end = omp_get_wtime();

/* print result */
printf("collapse-clause: w[0]=%lf time=%lf\n", w[0], end - start);

return 0;
}
``````
Compilation command:
``icpx -fiopenmp -fopenmp-targets=spir64 test_map_to.cpp``
Run command:
``OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=1 ./a.out``
The amount of data transferred between host and device can be seen in LIBOMPTARGET_DEBUG=1 output by grepping for
"Libomptarget --> Moving"
. The output shows that the
map(to: )
clauses for the arrays
ur
,
us
, and
ut
cause the transfer of 262,144 bytes from host to device for each of the arrays:
``````\$ grep "Libomptarget --> Moving" test_map_to.debug
Libomptarget --> Moving 262144 bytes (hst:0x00007fffca630880) -> (tgt:0xff00fffffff30000)
Libomptarget --> Moving 262144 bytes (hst:0x00007fffca670880) -> (tgt:0xff00fffffff70000)
Libomptarget --> Moving 262144 bytes (hst:0x00007fffca6b0880) -> (tgt:0xff00fffffffb0000)
Libomptarget --> Moving 2048 bytes (hst:0x00007fffca770880) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007fffca730880) -> (tgt:0xff00ffffffde0000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00ffffffef0000) -> (hst:0x00007fffca6f0880)``````
These data transfers are wasteful because the arrays
ur
,
us
, and
ut
are simply used as temporary work arrays on the device. A better approach would be to place the declarations of the arrays between the
declare target
and
end declare target
directives. This indicates that the arrays are mapped to the device data environment, but no data transfers for these arrays occur unless the
target update
directive is used to manage the consistency of the arrays between host and device. This approach is illustrated in the following modified example.
``````//==============================================================
//
// =============================================================
// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <math.h>
#include <omp.h>

#define P 16
#define BLOCKS 8
#define SIZE (BLOCKS * P * P * P)

#define MAX 100
#define scaled_rand() ((rand() % MAX) / (1.0 * MAX))

#define IDX2(i, j) (i * P + j)
#define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k)

#pragma omp declare target
double ur[SIZE], us[SIZE], ut[SIZE]; /* work arrays */
#pragma omp end declare target

int main(void) {
double w[SIZE];            /* output */
double u[SIZE], dx[P * P]; /* input */
int b, i, j, k, l;         /* loop counters */
double start, end;         /* timers */

omp_set_default_device(0);

/* dummy target region, so as not to measure startup time. */
#pragma omp target
{ ; }

/* initialize input with random values */
srand(0);
for (int i = 0; i < SIZE; i++)
u[i] = scaled_rand();

for (int i = 0; i < P * P; i++)
dx[i] = scaled_rand();

start = omp_get_wtime();

#pragma omp target teams distribute parallel for simd simdlen(16) collapse(4) \
map(to:u[0:SIZE],dx[0:P*P]) \
map(from:w[0:SIZE])         \
private(b,i,j,k,l)
for (b = 0; b < BLOCKS; b++) {
for (i = 0; i < P; i++) {
for (j = 0; j < P; j++) {
for (k = 0; k < P; k++) {
w[IDX4(b, i, j, k)] = 0.;
ur[IDX4(b, i, j, k)] = 0.;
us[IDX4(b, i, j, k)] = 0.;
ut[IDX4(b, i, j, k)] = 0.;

for (l = 0; l < P; l++) {
ur[IDX4(b, i, j, k)] += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us[IDX4(b, i, j, k)] += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut[IDX4(b, i, j, k)] += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] = ur[IDX4(b, i, j, k)] * us[IDX4(b, i, j, k)] *
ut[IDX4(b, i, j, k)];
}
}
}
}

end = omp_get_wtime();

/* print result */
printf("collapse-clause: w[0]=%lf time=%lf\n", w[0], end - start);

return 0;
}
``````
In the above modified example, memory is allocated for arrays
ur
,
us
, and
ut
on the device, but no data transfers for these arrays take place. This is seen by grepping for
"Libomptarget --> Moving"
in LIBOMPTARGET_DEBUG=1 output. We no longer see the transfer of 262,144 bytes from host to device for each of the arrays:
``````\$ grep "Libomptarget --> Moving" test_declare_target.debug
Libomptarget --> Moving 2048 bytes (hst:0x00007ffc546bfec0) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007ffc5467fec0) -> (tgt:0xff00fffffff30000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00ffffffef0000) -> (hst:0x00007ffc5463fec0)``````
An alternative approach for allocating memory on the device, without transferring any data between host and device, uses the
map(alloc: )
map(to: )
clause, as shown below (lines 51-53).
``````//==============================================================
//
// =============================================================
// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <math.h>
#include <omp.h>

#define P 16
#define BLOCKS 8
#define SIZE (BLOCKS * P * P * P)

#define MAX 100
#define scaled_rand() ((rand() % MAX) / (1.0 * MAX))

#define IDX2(i, j) (i * P + j)
#define IDX4(b, i, j, k) (b * P * P * P + i * P * P + j * P + k)

int main(void) {
double w[SIZE];                      /* output */
double u[SIZE], dx[P * P];           /* input */
double ur[SIZE], us[SIZE], ut[SIZE]; /* work arrays */
int b, i, j, k, l;                   /* loop counters */
double start, end;                   /* timers */

omp_set_default_device(0);

/* dummy target region, so as not to measure startup time. */
#pragma omp target
{ ; }

/* initialize input with random values */
srand(0);
for (int i = 0; i < SIZE; i++)
u[i] = scaled_rand();

for (int i = 0; i < P * P; i++)
dx[i] = scaled_rand();

start = omp_get_wtime();

#pragma omp target teams distribute parallel for simd simdlen(16) collapse(4) \
map(to:u[0:SIZE],dx[0:P*P]) \
map(from:w[0:SIZE])         \
map(alloc:ur[0:SIZE])       \
map(alloc:us[0:SIZE])       \
map(alloc:ut[0:SIZE])       \
private(b,i,j,k,l)
for (b = 0; b < BLOCKS; b++) {
for (i = 0; i < P; i++) {
for (j = 0; j < P; j++) {
for (k = 0; k < P; k++) {
w[IDX4(b, i, j, k)] = 0.;
ur[IDX4(b, i, j, k)] = 0.;
us[IDX4(b, i, j, k)] = 0.;
ut[IDX4(b, i, j, k)] = 0.;

for (l = 0; l < P; l++) {
ur[IDX4(b, i, j, k)] += dx[IDX2(i, l)] * u[IDX4(b, l, j, k)];
us[IDX4(b, i, j, k)] += dx[IDX2(k, l)] * u[IDX4(b, i, l, k)];
ut[IDX4(b, i, j, k)] += dx[IDX2(j, l)] * u[IDX4(b, i, j, l)];
}

w[IDX4(b, i, j, k)] = ur[IDX4(b, i, j, k)] * us[IDX4(b, i, j, k)] *
ut[IDX4(b, i, j, k)];
}
}
}
}

end = omp_get_wtime();

/* print result */
printf("collapse-clause: w[0]=%lf time=%lf\n", w[0], end - start);

return 0;
}
``````
In the above example, the
map(alloc: )
clauses for arrays
ur
,
us
, and
ut
cause memory to be allocated for
ur
,
us
, and
ut
on the device, and no data transfers occur – as in the
declare target
and
end declare target
case:
``````\$ grep "Libomptarget --> Moving" test_map_alloc.debug
Libomptarget --> Moving 2048 bytes (hst:0x00007ffd46f256c0) -> (tgt:0xff00ffffffee0000)
Libomptarget --> Moving 262144 bytes (hst:0x00007ffd46ee56c0) -> (tgt:0xff00ffffffde0000)
Libomptarget --> Moving 262144 bytes (tgt:0xff00ffffffef0000) -> (hst:0x00007ffd46ea56c0)``````
The performance of the various versions when running on the particular GPU used (1-tile only) was as follows:
``````map(to: ) version                           : 0.001430 seconds
declare target / end declare target version : 0.000874 seconds
map(alloc: ) version                        : 0.000991 seconds``````

#### Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.