Developer Guide

Contents

Memory Allocation

This section looks at various ways of allocating memory, and the types of allocations that are supported. A pointer on the host has the same size as a pointer on the device.
Host allocations
are owned by the host and are intended to be allocated out of system memory. Host allocations are accessible by the host and all supported devices. Therefore, the same pointer to a host allocation may be used on the host and all supported devices. Host allocations are not expected to migrate between system memory and device-local memory. When a pointer to a host allocation is accessed on a device, data is typically sent over a bus, such as PCI-Express, that connects the device to the host.
Device allocations
are owned by a specific device and are intended to be allocated out of device-local memory. Storage allocated can be read from and written to on that device, but is not directly accessible from the host or any other supported devices.
Shared allocations
are accessible by the host and all supported devices. So the same pointer to a shared allocation may be used on the host and all supported devices, like in a host allocation. Shared allocations, however, are not owned by any particular device, but are intended to migrate between the host and one or more devices. This means that accesses on a device, after the migration has occurred, happen from much faster device-local memory instead of remotely accessing system memory though the higher-latency bus connection.
Shared-system allocations
are a sub-class of shared allocations, where the memory is allocated by a system allocator (such as
malloc
or
new
) rather than by an allocation API (such as the OpenMP memory allocation API). Shared-system allocations have no associated device; they are inherently cross-device. Like other shared allocations, Shared-system allocations are intended to migrate between the host and supported devices, and the same pointer to a shared-system allocation may be used on the host and all supported devices.
Note:
  • Currently, shared-system allocations are not supported on ATS and PVC systems. However, shared allocations where memory is allocated by an allocation API are supported on ATS and PVC.
The following table summarizes the characteristics of the various types of memory allocation.
Type of allocation
Initial location
Accessible on host?
Accessible on device?
Host
Host
Yes
Yes
Device
Device
No
Yes
Shared
Host, Device, or Unspecified
Yes
Yes
Shared-System
Host
Yes
Yes
Host allocations offer wide accessibility (can be accessed directly from the host and all supported devices), but have potentially high per-access costs because data is typically sent over a bus such as PCI Express*.
Shared allocations also offer wide accessibility, but the per-access costs are potentially lower than host allocations, because data is migrated to the accessing device.
Device allocations have access limitations (cannot be accessed directly from the host or other supported devices), but offer higher performance because accesses are to device-local memory.

OpenMP Runtime Routines for Memory Allocation

Intel compilers support a number of OpenMP runtime routines for performing memory allocations. These routines are shown in the table below.
OpenMP memory allocation routine
Intel extension?
Type of allocation
omp_target_alloc
No
Device
omp_target_alloc_device
Yes
Device
omp_target_alloc_host
Yes
Host
omp_target_alloc_shared
Yes
Shared
Note that the three routines
omp_target_alloc_device
,
omp_target_alloc_host
, and
omp_target_alloc_shared
are Intel extensions to the OpenMP specification.
The following examples use the above OpenMP memory allocation routines. Compare those to the ones using
map
clauses.
For more information about memory allocation, see:

Using the
map
Clause

The first example uses
map
clauses to allocate memory on a device and copy data between the host and the device.
In the following example, arrays
A
,
B
, and
C
are allocated in system memory by calling the C/C++ standard library routine,
malloc
.
The
target
construct on line 58 is the main kernel that computes the values of array
C
on the device. The
map(tofrom: C[0:length)
clause is specified on this
target
construct since the values of
C
need to be transferred from the host to the device before the computation, and from the device to the host at the end of the computation. The
map(to: A[0:length], B[0:length])
is specified for arrays``A`` and
B
since the values of these arrays need to be transferred from the host to the device, and the device only reads these values. Under the covers, the
map
clauses cause storage for the arrays to be allocated on the device and data to be copied from the host to the device, and vice versa.
//============================================================== // Copyright © 2022 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= // clang-format off #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #define iterations 100 #define length 64*1024*1024 int main(void) { size_t bytes = length*sizeof(double); double * __restrict A; double * __restrict B; double * __restrict C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays on the host using plain malloc() A = (double *) malloc(bytes); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using plain malloc().\n"); exit(1); } B = (double *) malloc(bytes); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using plain malloc().\n"); exit(1); } C = (double *) malloc(bytes); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using plain malloc().\n"); exit(1); } // Initialize the arrays #pragma omp parallel for for (size_t i=0; i<length; i++) { A[i] = 2.0; B[i] = 2.0; C[i] = 0.0; } // Perform the computation nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for \ map(to: A[0:length], B[0:length]) \ map(tofrom: C[0:length]) for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } nstream_time = omp_get_wtime() - nstream_time; // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp parallel for reduction(+:asum) for (size_t i=0; i<length; i++) { asum += fabs(C[i]); } free(A); free(B); free(C); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }
Compilation command:
icx-cc -fiopenmp -fopenmp-targets=spir64 test_target_map.cpp
Run command:
OMP_TARGET_OFFLOAD=MANDATORY ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_DEBUG=1 ./a.out
The
map
clauses on the
target
construct inside the
iterations
loop cause data (values of
A
,
B
,
C
) to be transferred from the host to the device at the beginning of each
target
region, and cause data (values of
C
) to be transferred from the device to the host at the end of each
target
region. These data transfers incur a significant performance overhead. A better approach using
map
clauses would be to put the whole
iterations
loop inside a
target data
construct with the
map
clauses. This causes the transfers to occur once at the beginning of the
iterations
loop, and another time at the end of the
iterations
loop. The modified example using
target data
and
map
clauses is shown below.
//============================================================== // Copyright © 2022 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= // clang-format off #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #define iterations 100 #define length 64*1024*1024 int main(void) { size_t bytes = length*sizeof(double); double * __restrict A; double * __restrict B; double * __restrict C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays on the host using plain malloc() A = (double *) malloc(bytes); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using plain malloc().\n"); exit(1); } B = (double *) malloc(bytes); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using plain malloc().\n"); exit(1); } C = (double *) malloc(bytes); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using plain malloc().\n"); exit(1); } // Initialize the arrays #pragma omp parallel for for (size_t i=0; i<length; i++) { A[i] = 2.0; B[i] = 2.0; C[i] = 0.0; } // Perform the computation nstream_time = omp_get_wtime(); #pragma omp target data map(to: A[0:length], B[0:length]) \ map(tofrom: C[0:length]) { for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } } nstream_time = omp_get_wtime() - nstream_time; // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp parallel for reduction(+:asum) for (size_t i=0; i<length; i++) { asum += fabs(C[i]); } free(A); free(B); free(C); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }

omp_target_alloc

Next, the example above is modified to use device allocations instead of
map
clauses. Storage for arrays
A
,
B
, and
C
is directly allocated on the device by calling the OpenMP runtime routine
omp_target_alloc
. The routine takes two arguments: the number of bytes to allocate on the device, and the number of the device on which to allocate the storage. The routine returns a device pointer that references the device address of the storage allocated on the device. If the call to
omp_target_alloc
returns NULL, then this indicates that the allocation was not successful.
To access the allocated memory in a
target
construct, the device pointer returned by a call to
omp_target_alloc
is listed in an
is_device_ptr
clause on the
target
construct. This ensures that there is no data transfer before and after kernel execution since the kernel operates on data that is already on the device.
At the end of the program, the runtime routine
omp_target_free
is used to deallocate the storage for
A
,
B
, and
C
on the device.
//============================================================== // Copyright © 2022 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= // clang-format off #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #define iterations 100 #define length 64*1024*1024 int main(void) { int device_id = omp_get_default_device(); size_t bytes = length*sizeof(double); double * __restrict A; double * __restrict B; double * __restrict C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays in device memory A = (double *) omp_target_alloc(bytes, device_id); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using omp_target_alloc().\n"); exit(1); } B = (double *) omp_target_alloc(bytes, device_id); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using omp_target_alloc().\n"); exit(1); } C = (double *) omp_target_alloc(bytes, device_id); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using omp_target_alloc().\n"); exit(1); } // Initialize the arrays #pragma omp target teams distribute parallel for \ is_device_ptr(A,B,C) for (size_t i=0; i<length; i++) { A[i] = 2.0; B[i] = 2.0; C[i] = 0.0; } // Perform the computation 'iterations' number of times nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for \ is_device_ptr(A,B,C) for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } nstream_time = omp_get_wtime() - nstream_time; // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp target teams distribute parallel for reduction(+:asum) \ map(tofrom: asum) is_device_ptr(C) for (size_t i=0; i<length; i++) { asum += fabs(C[i]); } omp_target_free(A, device_id); omp_target_free(B, device_id); omp_target_free(C, device_id); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }
Notes:
  • When calling
    omp_target_alloc
    , the device number specified must be one of the supported devices, other than the host device. This will be the device on which storage will be allocated.
  • Since the arrays
    A
    ,
    B
    , and
    C
    are not accessible from the host, the initialization of the arrays, kernel execution, and summation of elements of
    C
    all need to be done inside OpenMP
    target
    regions.
  • A device allocation can only be accessed by the device specified in the
    omp_target_alloc
    call, but may be copied to memory allocated on the host or other devices by calling
    omp_target_memcpy
    .

omp_target_alloc_device

The Intel extension
omp_target_alloc_device
is similar to
omp_target_alloc
. It is also called with two arguments: the number of bytes to allocate on the device, and the number of the device on which to allocate the storage. The routine returns a device pointer that references the device address of the storage allocated on the device. If the call to
omp_target_alloc_device
returns NULL, then this indicates that the allocation was not successful.
The above
omp_target_alloc
example can be rewritten using
omp_target_alloc_device
by simply replacing the call to
omp_target_alloc
with a call to
omp_targer_alloc_device
as shown below.
At the end of the program, the runtime routine
omp_target_free
is used to deallocate the storage for
A
,
B
, and
C
on the device.
// Allocate arrays in device memory A = (double *) omp_target_alloc_device(bytes, device_id); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using omp_target_alloc_device().\n"); exit(1); } B = (double *) omp_target_alloc_device(bytes, device_id); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using omp_target_alloc_device().\n"); exit(1); } C = (double *) omp_target_alloc_device(bytes, device_id); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using omp_target_alloc_device().\n"); exit(1); }
Note:
  • All of the above Notes that apply to
    omp_target_alloc
    also apply to
    omp_target_alloc_device
    .

omp_target_alloc_host

The above example can also be rewritten by doing a host allocation for
A
,
B
, and
C
. This allows the memory to be accessible to the host and all supported devices.
In the following modified example, the
omp_target_alloc_host
runtime routine (an Intel extension) is called to allocate storage for each of the arrays
A
,
B
, and
C
. The routine takes two arguments: the number of bytes to allocate, and a device number. The device number must be one of the supported devices, other than the host device. The routine returns a pointer to a storage location in host memory. If the call to
omp_target_alloc_host
returns NULL, this indicates that the allocation was not successful.
Note the directive
requires unified_address
is specified at the top of the program. This requires that the implementation guarantee that all devices accessible through OpenMP API routines and directives use a unified address space. In this address space, a pointer will always refer to the same location in memory from all devices, and the
is_device_ptr
clause is not necessary to obtain device addresses from device pointers for use inside target regions. When using Intel compilers, the
requires unified_address
directive is actually not needed, since unified address space is guaranteed by default. However, for portability the code includes the directive.
The pointer returned by a call to
omp_target_alloc_host
can be used to access the storage from the host and all supported devices. No
map
clauses and no
is_device_ptr
clauses are needed on a
target
construct to access the memory from a device since a unified address space is used.
At the end of the program, the runtime routine
omp_target_free
is used to deallocate the storage for
A
,
B
, and
C
.
//============================================================== // Copyright © 2022 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= // clang-format off #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #pragma omp requires unified_address #define iterations 100 #define length 64*1024*1024 int main(void) { int device_id = omp_get_default_device(); size_t bytes = length*sizeof(double); double * __restrict A; double * __restrict B; double * __restrict C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays in host memory A = (double *) omp_target_alloc_host(bytes, device_id); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using omp_target_alloc_host().\n"); exit(1); } B = (double *) omp_target_alloc_host(bytes, device_id); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using omp_target_alloc_host().\n"); exit(1); } C = (double *) omp_target_alloc_host(bytes, device_id); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using omp_target_alloc_host().\n"); exit(1); } // Initialize the arrays #pragma omp parallel for for (size_t i=0; i<length; i++) { A[i] = 2.0; B[i] = 2.0; C[i] = 0.0; } // Perform the computation nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } nstream_time = omp_get_wtime() - nstream_time; // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp parallel for reduction(+:asum) for (size_t i=0; i<length; i++) { asum += fabs(C[i]); } omp_target_free(A, device_id); omp_target_free(B, device_id); omp_target_free(C, device_id); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }
Notes:
  • When calling
    omp_target_alloc_host
    , the device number specified must be one of the supported devices, other than the host device.
  • Since the arrays
    A
    ,
    B
    , and
    C
    are accessible from the host and device, the initialization of the arrays and summation of elements of
    C
    may be done either on the host (outside of a
    target
    construct) or on the device (inside a
    target
    construct).
  • ATS and PVC do not support atomic operations (or algorithms that use atomic operations, such as some reductions) on host allocations (i.e., memory allocated via
    omp_target_alloc_host
    ). Use atomic operations on memory allocated via
    omp_target_alloc_device
    , instead.

omp_target_alloc_shared

The above example is modified so that shared allocations are used instead of host allocations. The
omp_target_alloc_shared
runtime routine is called to allocate storage for each of arrays
A
,
B
, and
C
. The routine takes two arguments: the number of bytes to allocate on the device, and a device number. The device number must be one of the supported devices, other than the host device. The routine returns a pointer to a storage location in shared memory. If the call to
omp_target_alloc_shared
returns NULL, then this indicates that the allocation was not successful.
Note the
requires unified_address
directive is specified at the top of the program, for portability.
The pointer returned by a call to
omp_target_alloc_shared
can be used to access the storage from the host and all supported devices. No
map
clauses and no
is_device_ptr
clauses are needed on a
target
construct to access the memory from a device since a unified address space is used.
At the end of the program, the runtime routine
omp_target_free
is used to deallocate the storage for
A
,
B
, and
C
.
// Allocate arrays in shared memory A = (double *) omp_target_alloc_shared(bytes, device_id); if (A == NULL){ printf(" ERROR: Cannot allocate space for A using omp_target_alloc_shared().\n"); exit(1); } B = (double *) omp_target_alloc_shared(bytes, device_id); if (B == NULL){ printf(" ERROR: Cannot allocate space for B using omp_target_alloc_shared().\n"); exit(1); } C = (double *) omp_target_alloc_shared(bytes, device_id); if (C == NULL){ printf(" ERROR: Cannot allocate space for C using omp_target_alloc_shared().\n"); exit(1); }
Notes:
  • When calling
    omp_target_alloc_shared
    , the device number specified must be one of the supported devices, other than the host device.
  • Since the arrays are accessible from the host and device, the initialization and verification may be done either on the host or on the device (inside a
    target
    construct).
  • Concurrent access from host and device to memory allocated via
    omp_target_alloc_shared
    is not supported.

omp_target_memcpy

The following example shows how the runtime routine
omp_target_memcpy
may be used to copy memory from host to device, and from device to host. First arrays
h_A
,
h_B
, and
h_C
are allocated in system memory using plain
malloc
, and then initialized. Corresponding arrays
d_A
,
d_B
, and
d_C
are allocated on the device using
omp_target_alloc
.
Before the start of the
target
construct on line 104, the values in
h_A
,
h_B
, and
h_C
are copied to
d_A
,
d_B
, and
d_C
by calling
omp_target_memcpy
. After the
target
region, new
d_C
values computed on the device are copied to
h_C
by calling
omp_target_memcpy
.
//============================================================== // Copyright © 2022 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= // clang-format off #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <omp.h> #define iterations 100 #define length 64*1024*1024 int main(void) { int device_id = omp_get_default_device(); int host_id = omp_get_initial_device(); size_t bytes = length*sizeof(double); double * __restrict h_A; double * __restrict h_B; double * __restrict h_C; double * __restrict d_A; double * __restrict d_B; double * __restrict d_C; double scalar = 3.0; double nstream_time = 0.0; // Allocate arrays h_A, h_B, and h_C on the host using plain malloc() h_A = (double *) malloc(bytes); if (h_A == NULL){ printf(" ERROR: Cannot allocate space for h_A using plain malloc().\n"); exit(1); } h_B = (double *) malloc(bytes); if (h_B == NULL){ printf(" ERROR: Cannot allocate space for h_B using plain malloc().\n"); exit(1); } h_C = (double *) malloc(bytes); if (h_C == NULL){ printf(" ERROR: Cannot allocate space for h_C using plain malloc().\n"); exit(1); } // Allocate arrays d_A, d_B, and d_C on the device using omp_target_alloc() d_A = (double *) omp_target_alloc(bytes, device_id); if (d_A == NULL){ printf(" ERROR: Cannot allocate space for d_A using omp_target_alloc().\n"); exit(1); } d_B = (double *) omp_target_alloc(bytes, device_id); if (d_B == NULL){ printf(" ERROR: Cannot allocate space for d_B using omp_target_alloc().\n"); exit(1); } d_C = (double *) omp_target_alloc(bytes, device_id); if (d_C == NULL){ printf(" ERROR: Cannot allocate space for d_C using omp_target_alloc().\n"); exit(1); } // Initialize the arrays on the host #pragma omp parallel for for (size_t i=0; i<length; i++) { h_A[i] = 2.0; h_B[i] = 2.0; h_C[i] = 0.0; } // Call omp_target_memcpy() to copy values from host to device int rc = 0; rc = omp_target_memcpy(d_A, h_A, bytes, 0, 0, device_id, host_id); if (rc) { printf("ERROR: omp_target_memcpy(A) returned %d\n", rc); exit(1); } rc = omp_target_memcpy(d_B, h_B, bytes, 0, 0, device_id, host_id); if (rc) { printf("ERROR: omp_target_memcpy(B) returned %d\n", rc); exit(1); } rc = omp_target_memcpy(d_C, h_C, bytes, 0, 0, device_id, host_id); if (rc) { printf("ERROR: omp_target_memcpy(C) returned %d\n", rc); exit(1); } // Perform the computation nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for \ is_device_ptr(d_A,d_B,d_C) for (size_t i=0; i<length; i++) { d_C[i] += d_A[i] + scalar * d_B[i]; } } nstream_time = omp_get_wtime() - nstream_time; // Call omp_target_memcpy() to copy values from device to host rc = omp_target_memcpy(h_C, d_C, bytes, 0, 0, host_id, device_id); if (rc) { printf("ERROR: omp_target_memcpy(A) returned %d\n", rc); exit(1); } // Validate and output results double ar = 2.0; double br = 2.0; double cr = 0.0; for (int iter = 0; iter<iterations; iter++) { for (int i=0; i<length; i++) { cr += ar + scalar * br; } } double asum = 0.0; #pragma omp parallel for reduction(+:asum) for (size_t i=0; i<length; i++) { asum += fabs(h_C[i]); } free(h_A); free(h_B); free(h_C); omp_target_free(d_A, device_id); omp_target_free(d_B, device_id); omp_target_free(d_C, device_id); double epsilon=1.e-8; if (fabs(cr - asum)/asum > epsilon) { printf("Failed Validation on output array\n" " Expected checksum: %lf\n" " Observed checksum: %lf\n" "ERROR: solution did not validate\n", cr, asum); return 1; } else { printf("Solution validates\n"); double avgtime = nstream_time/iterations; printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime); } return 0; }

Performance Considerations

In the above examples (using the
map
clause,
omp_target_alloc
,
omp_target_alloc_device
,
omp_target_alloc_host
,
omp_target_alloc_shared
,
omp_target_memcpy
), the main kernel is the
target
construct that computes the values of array
C
. To get more accurate timings, this
target
construct is enclosed in a loop, so the offload happens
iterations
number of times (where
iterations
= 100). The average kernel time is computed by dividing the total time taken by the
iterations
loop by 100.
// Perform the computation 'iterations' number of times nstream_time = omp_get_wtime(); for (int iter = 0; iter<iterations; iter++) { #pragma omp target teams distribute parallel for \ is_device_ptr(A,B,C) for (size_t i=0; i<length; i++) { C[i] += A[i] + scalar * B[i]; } } nstream_time = omp_get_wtime() - nstream_time;
LIBOMPTARGET_DEBUG=1 output shows that all the above examples have the same ND_range partitioning.
Target LEVEL0 RTL --> Allocated a device memory 0xff00000020200000 Libomptarget --> omp_target_alloc returns device ptr 0xff00000020200000 Libomptarget --> Call to omp_target_alloc for device 0 requesting 536870912 bytes Libomptarget --> Call to omp_get_num_devices returning 1 Libomptarget --> Call to omp_get_initial_device returning 1 Libomptarget --> Checking whether device 0 is ready. Libomptarget --> Is the device 0 (local ID 0) initialized? 1 Libomptarget --> Device 0 is ready to use.
The following table shows the average times taken by the kernel in the various versions when running on the particular GPU used (1-tile only).
Version
Time (seconds)
map
0.183604
map
+
target data
0.012757
omp_target_alloc
0.002501
omp_target_alloc_device
0.002499
omp_target_alloc_host
0.074412
omp_target_alloc_shared
0.012491
omp_target_memcpy
0.011072
The above performance numbers show that the
map
version is the slowest version (0.183604 seconds). This is because of the data transfers that occur at the beginning and end of each kernel launch. The main kernel is launched 100 times. At the beginning of each kernel launch, storage for arrays
A
,
B
and
C
is allocated on the device, and the values of these arrays are copied from the host to the device. At the end of the kernel, the values of array
C
are copied from the device to the host. Putting the whole
iterations
loop inside a
target data
construct with
map
clauses reduced the runtime to 0.012757 seconds, because the transfers occur once at the launch of the first kernel in the
iterations
loop, and again after the last kernel in that loop.
The
omp_target_alloc
and
omp_target_alloc_device
versions have the best performance (0.002501 and 0.002499 seconds, respectively). In these versions, storage for
A
,
B
, and
C
is allocated directly in device memory, so accesses on the device happen from device-local memory. This is a useful model for applications that use scratch arrays on the device side. These arrays never need to be accessed on the host. In such cases, the recommendation is to allocate the scratch arrays on the device and not worry about data transfers, as illustrated in this example.
The
omp_target_alloc_shared
version also performs well, but is somewhat slower (0.012491 seconds). In this version, storage for
A
,
B
, and
C
is allocated in shared memory. So the data can migrate between the host and the device. There is the overhead of migration but, after migration, accesses on the device happen from much faster device-local memory. In this version, the initialization of the arrays happens on the host. At the first kernel launch, the arrays are migrated to the device, and the kernels access the arrays locally on the device. Finally, before the host performs the reduction computation, the entire C array is migrated back to the host.
The
omp_target_alloc_host
version (0.074412 seconds) takes almost 6x more time than the
omp_target_alloc_shared
version. This is because data allocated in host memory does not migrate from the host to the device. When the kernel tries to access the data, the data is typically sent over a bus, such as PCI Express, that connects the device to the host. This is slower than accessing local device memory. If the device accesses only a small part of an array infrequently, then that array may be allocated in host memory using
omp_target_alloc_host
. However, if the array is accessed frequently on the device side, then it should be kept in device memory. Keeping the data in host memory and accessing it over the PCI will degrade performance.
Finally, a note regarding data transfers: The amount of data transferred in the
map
version can be seen in LIBOMPTARGET_DEBUG=1 output by grepping for
"Libomptarget --> Moving"
. Notice that each launch of the main kernel yields the following data transfers:
$ grep "Libomptarget --> Moving" test_target_map.debug Libomptarget --> Moving 536870912 bytes (hst:0x00007f1a5fc8b010) -> (tgt:0xff00000000200000) Libomptarget --> Moving 536870912 bytes (hst:0x00007f1a9fc8d010) -> (tgt:0xff00000020200000) Libomptarget --> Moving 536870912 bytes (hst:0x00007f1a7fc8c010) -> (tgt:0xff00000040200000) Libomptarget --> Moving 536870912 bytes (tgt:0xff00000000200000) -> (hst:0x00007f1a5fc8b010)
On the other hand, data transfers in the
omp_target_alloc_...
versions are handled by a lower layer of the runtime system. So grepping for
"Libomptarget --> Moving"
in LIBOMPTARGET_DEBUG=1 output for these versions will not show the data transfers that took place.
Fortran
The Fortran version of the example using
target data
and
map
clauses is shown below.
!============================================================= ! Copyright © 2022 Intel Corporation ! ! SPDX-License-Identifier: MIT !============================================================= program main use iso_fortran_env use omp_lib implicit none integer, parameter :: iterations=100 integer, parameter :: length=64*1024*1024 real(kind=REAL64), parameter :: epsilon=1.D-8 real(kind=REAL64), allocatable :: A(:) real(kind=REAL64), allocatable :: B(:) real(kind=REAL64), allocatable :: C(:) real(kind=REAL64) :: scalar=3.0 real(kind=REAL64) :: ar, br, cr, asum real(kind=REAL64) :: nstream_time, avgtime integer :: err, i, iter ! ! Allocate arrays on the host using plain allocate allocate( A(length), stat=err ) if (err .ne. 0) then print *, "Allocation of A returned ", err stop 1 endif allocate( B(length), stat=err ) if (err .ne. 0) then print *, "Allocation of B returned ", err stop 1 endif allocate( C(length), stat=err ) if (err .ne. 0) then print *, "Allocation of C returned ", err stop 1 endif ! ! Initialize the arrays !$omp parallel do do i = 1, length A(i) = 2.0 B(i) = 2.0 C(i) = 0.0 end do ! ! Perform the computation nstream_time = omp_get_wtime() !$omp target data map(to: A, B) map(tofrom: C) do iter = 1, iterations !$omp target teams distribute parallel do do i = 1, length C(i) = C(i) + A(i) + scalar * B(i) end do end do !$omp end target data nstream_time = omp_get_wtime() - nstream_time ! ! Validate and output results ar = 2.0 br = 2.0 cr = 0.0 do iter = 1, iterations do i = 1, length cr = cr + ar + scalar * br end do end do asum = 0.0 !$omp parallel do reduction(+:asum) do i = 1, length asum = asum + abs(C(i)) end do if (abs(cr - asum)/asum > epsilon) then write(*,110) "Failed Validation on output array: Expected =", cr, ", Observed =", asum else avgtime = nstream_time/iterations write(*,120) "Solution validates: Checksum =", asum, ", Avg time (s) =", avgtime endif 110 format (A, F20.6, A, F20.6) 120 format (A, F20.6, A, F10.6) deallocate(A) deallocate(B) deallocate(C) end program main
The Fortran version of the example using
omp_target_alloc_device
is shown below. In this example,
allocate
directives, with the allocator
omp_target_device_mem_alloc
, are used to allocate arrays
A
,
B
, and
C
on the device. The
use_device_addr(A, B, C)
clause is used on the
target data
directive (line 37) to indicate that the arrays have device addresses, and these addresses should be used in the
target
region.
!============================================================= ! Copyright © 2022 Intel Corporation ! ! SPDX-License-Identifier: MIT !============================================================= program main use iso_fortran_env use omp_lib implicit none integer, parameter :: iterations=100 integer, parameter :: length=64*1024*1024 real(kind=REAL64), parameter :: epsilon=1.D-8 real(kind=REAL64), allocatable :: A(:) real(kind=REAL64), allocatable :: B(:) real(kind=REAL64), allocatable :: C(:) real(kind=REAL64) :: scalar=3.0 real(kind=REAL64) :: ar, br, cr, asum real(kind=REAL64) :: nstream_time, avgtime integer :: i, iter ! ! Allocate arrays in device memory !$omp allocate allocator(omp_target_device_mem_alloc) allocate(A(length)) !$omp allocate allocator(omp_target_device_mem_alloc) allocate(B(length)) !$omp allocate allocator(omp_target_device_mem_alloc) allocate(C(length)) ! ! Begin target data !$omp target data use_device_addr(A, B, C) ! ! Initialize the arrays !$omp target teams distribute parallel do do i = 1, length A(i) = 2.0 B(i) = 2.0 C(i) = 0.0 end do ! ! Perform the computation nstream_time = omp_get_wtime() do iter = 1, iterations !$omp target teams distribute parallel do do i = 1, length C(i) = C(i) + A(i) + scalar * B(i) end do end do nstream_time = omp_get_wtime() - nstream_time ! ! Validate and output results ar = 2.0 br = 2.0 cr = 0.0 do iter = 1, iterations do i = 1, length cr = cr + ar + scalar * br end do end do asum = 0.0 !$omp target teams distribute parallel do reduction(+:asum) & !$omp map(tofrom: asum) do i = 1, length asum = asum + abs(C(i)) end do ! ! End target data !$omp end target data if (abs(cr - asum)/asum > epsilon) then write(*,110) "Failed Validation on output array: Expected =", cr, ", Observed =", asum else avgtime = nstream_time/iterations write(*,120) "Solution validates: Checksum =", asum, ", Avg time (s) =", avgtime endif 110 format (A, F20.6, A, F20.6) 120 format (A, F20.6, A, F10.6) deallocate(A) deallocate(B) deallocate(C) end program main

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.