Memory Allocation
Memory allocations are represented as pointers in the application. A
pointer on the host has the same size as a pointer on the device. This
section looks at various ways of allocating memory, and the types of
allocations that are supported.
Host allocations
are owned by the host and are intended to be
allocated out of system memory. Host allocations are accessible by the
host and all supported devices. Therefore, the same pointer to a host
allocation may be used on the host and all supported devices. Host
allocations are not expected to migrate between system memory and
device-local memory. When a pointer to a host allocation is accessed
on a device, data is typically sent over a bus, such as PCI-Express,
that connects the device to the host.Device allocations
are owned by a specific device and are intended
to be allocated out of device-local memory. Storage allocated can be
read from and written to on that device, but is not directly
accessible from the host or any other supported devices.Shared allocations
are accessible by the host and all supported
devices. So the same pointer to a shared allocation may be used on the
host and all supported devices, like in a host allocation. Shared
allocations, however, are not owned by any particular device, but are
intended to migrate between the host and one or more devices. This
means that accesses on a device, after the migration has occurred,
happen from much faster device-local memory instead of remotely
accessing system memory though the higher-latency bus connection.Shared-system allocations
are a sub-class of shared allocations,
where the memory is allocated by a system allocator (such as
malloc
or new
) rather than by an allocation API (such as the
OpenMP* memory allocation API). Shared-system allocations have no
associated device; they are inherently cross-device. Like other shared
allocations, Shared-system allocations are intended to migrate between
the host and supported devices, and the same pointer to a
shared-system allocation may be used on the host and all supported
devices.note:: Currently, shared-system allocations are not supported on ATS and PVC systems. However, shared allocations where memory is allocated by an allocation API are supported on ATS and PVC.
The following table summarizes the characteristics of the various
types of memory allocation.
Type of allocation | Initial location | Accessible
on host? | Accessible
on device? |
---|---|---|---|
Host | Host | Yes | Yes |
Device | Device | No | Yes |
Shared | Host, Device,
or Unspecified | Yes | Yes |
Shared-System | Host | Yes | Yes |
Host allocations offer wide accessibility (can be accessed directly
from the host and all supported devices), but have potentially high
per-access costs because data is typically sent over a bus such as
PCI Express*.
Shared allocations also offer wide accessibility, but the per-access
costs are potentially lower than host allocations, because data is
migrated to the accessing device.
Device allocations have access limitations (cannot be accessed
directly from the host or other supported devices), but offer higher
performance because accesses are to device-local memory.
OpenMP Runtime Routines for Memory Allocation
Intel compilers support a number of OpenMP runtime routines for performing
memory allocations. These routines are shown in the table below.
OpenMP memory allocation
routine | Intel extension? | Type of allocation |
---|---|---|
omp_target_alloc | No | Device |
omp_target_alloc_device | Yes | Device |
omp_target_alloc_host | Yes | Host |
omp_target_alloc_shared | Yes | Shared |
Note that the three routines
omp_target_alloc_device
,
omp_target_alloc_host
, and omp_target_alloc_shared
are Intel
extensions to the OpenMP specification.The following examples use the above OpenMP memory allocation
routines. Compare those to the ones using
map
clauses.For more information about memory allocation, see:
- The DPC++ part of this guide
Using the map Clause
map
ClauseThe first example uses
map
clauses to allocate
memory on a device and copy data between the host and the device.In the following example, arrays
A
, B
, and C
are allocated
in system memory by calling the C/C++ standard library routine,
malloc
.The
target
construct on line 53 is the main kernel that computes
the values of array C
on the device. The map(tofrom:
C[0:length)
clause is specified on this target
construct since
the values of C
need to be transferred from the host to the device
before the computation, and from the device to the host at the end of
the computation. The map(to: A[0:length], B[0:length])
is
specified for arrays``A`` and B
since the values of these arrays
need to be transferred from the host to the device, and the device
only reads these values. Under the covers, the map
clauses cause
storage for the arrays to be allocated on the device and data to be
copied from the host to the device, and vice versa.// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <omp.h>
#define iterations 1000
#define length 64*1024*1024
int main(int argc, char * argv[])
{
size_t bytes = length*sizeof(double);
double * __restrict A;
double * __restrict B;
double * __restrict C;
double scalar = 3.0;
double nstream_time = 0.0;
// Allocate arrays on the host using plain malloc()
A = (double *) malloc(bytes);
if (A == NULL){
printf(" ERROR: Cannot allocate space for A using plain malloc().\n");
exit(1);
}
B = (double *) malloc(bytes);
if (B == NULL){
printf(" ERROR: Cannot allocate space for B using plain malloc().\n");
exit(1);
}
C = (double *) malloc(bytes);
if (C == NULL){
printf(" ERROR: Cannot allocate space for C using plain malloc().\n");
exit(1);
}
// Initialize the arrays
#pragma omp parallel for
for (size_t i=0; i<length; i++) {
A[i] = 2.0;
B[i] = 2.0;
C[i] = 0.0;
}
// Perform the computation
nstream_time = omp_get_wtime();
for (int iter = 0; iter<iterations; iter++) {
#pragma omp target teams distribute parallel for \
map(to: A[0:length], B[0:length]) \
map(tofrom: C[0:length])
for (size_t i=0; i<length; i++) {
C[i] += A[i] + scalar * B[i];
}
}
nstream_time = omp_get_wtime() - nstream_time;
// Validate and output results
double ar = 2.0;
double br = 2.0;
double cr = 0.0;
for (int iter = 0; iter<iterations; iter++) {
for (int i=0; i<length; i++) {
cr += ar + scalar * br;
}
}
double asum = 0.0;
#pragma omp parallel for reduction(+:asum)
for (size_t i=0; i<length; i++) {
asum += fabs(C[i]);
}
free(A);
free(B);
free(C);
double epsilon=1.e-8;
if (fabs(cr - asum)/asum > epsilon) {
printf("Failed Validation on output array\n"
" Expected checksum: %lf\n"
" Observed checksum: %lf\n"
"ERROR: solution did not validate\n", cr, asum);
return 1;
} else {
printf("Solution validates\n");
double avgtime = nstream_time/iterations;
printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
}
return 0;
}
The
map
clauses on the target
construct inside the
iterations
loop cause data (values of A
, B
, C
) to be
transferred from the host to the device at the beginning of each
target
region, and cause data (values of C
) to be transferred
from the device to the host at the end of each target
region. These data transfers incur a significant performance
overhead. A better approach using map
clauses would be to put the
whole iterations
loop inside a target data
construct with the
map
clauses. This causes the transfers to occur once at the
beginning of the iterations
loop, and another time at the end of
the iterations
loop. The modified example using target data
and map
clauses is shown below.// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <omp.h>
#define iterations 1000
#define length 64*1024*1024
int main(int argc, char * argv[])
{
size_t bytes = length*sizeof(double);
double * __restrict A;
double * __restrict B;
double * __restrict C;
double scalar = 3.0;
double nstream_time = 0.0;
// Allocate arrays on the host using plain malloc()
A = (double *) malloc(bytes);
if (A == NULL){
printf(" ERROR: Cannot allocate space for A using plain malloc().\n");
exit(1);
}
B = (double *) malloc(bytes);
if (B == NULL){
printf(" ERROR: Cannot allocate space for B using plain malloc().\n");
exit(1);
}
C = (double *) malloc(bytes);
if (C == NULL){
printf(" ERROR: Cannot allocate space for C using plain malloc().\n");
exit(1);
}
// Initialize the arrays
#pragma omp parallel for
for (size_t i=0; i<length; i++) {
A[i] = 2.0;
B[i] = 2.0;
C[i] = 0.0;
}
// Perform the computation
nstream_time = omp_get_wtime();
#pragma omp target data map(to: A[0:length], B[0:length]) \
map(tofrom: C[0:length])
{
for (int iter = 0; iter<iterations; iter++) {
#pragma omp target teams distribute parallel for
for (size_t i=0; i<length; i++) {
C[i] += A[i] + scalar * B[i];
}
}
}
nstream_time = omp_get_wtime() - nstream_time;
// Validate and output results
double ar = 2.0;
double br = 2.0;
double cr = 0.0;
for (int iter = 0; iter<iterations; iter++) {
for (int i=0; i<length; i++) {
cr += ar + scalar * br;
}
}
double asum = 0.0;
#pragma omp parallel for reduction(+:asum)
for (size_t i=0; i<length; i++) {
asum += fabs(C[i]);
}
free(A);
free(B);
free(C);
double epsilon=1.e-8;
if (fabs(cr - asum)/asum > epsilon) {
printf("Failed Validation on output array\n"
" Expected checksum: %lf\n"
" Observed checksum: %lf\n"
"ERROR: solution did not validate\n", cr, asum);
return 1;
} else {
printf("Solution validates\n");
double avgtime = nstream_time/iterations;
printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
}
return 0;
}
omp_target_alloc
omp_target_alloc
Next, the example above is modified to use device allocations instead of
map
clauses. Storage for arrays A
, B
, and C
is
directly allocated on the device by calling the OpenMP runtime routine
omp_target_alloc
. The routine takes two arguments: the number of
bytes to allocate on the device, and the number of the device on which
to allocate the storage. The routine returns a device pointer that
references the device address of the storage allocated on the
device. If the call to omp_target_alloc
returns NULL, then this
indicates that the allocation was not successful.To access the allocated memory in a
target
construct, the device
pointer returned by a call to omp_target_alloc
is listed in an
is_device_ptr
clause on the target
construct. This ensures
that there is no data transfer before and after kernel execution since
the kernel operates on data that is already on the device.At the end of the program, the runtime routine
omp_target_free
is
used to deallocate the storage for A
, B
, and C
on the
device.// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <omp.h>
#define iterations 1000
#define length 64*1024*1024
int main(int argc, char * argv[])
{
int device_id = omp_get_default_device();
size_t bytes = length*sizeof(double);
double * __restrict A;
double * __restrict B;
double * __restrict C;
double scalar = 3.0;
double nstream_time = 0.0;
// Allocate arrays in device memory
A = (double *) omp_target_alloc(bytes, device_id);
if (A == NULL){
printf(" ERROR: Cannot allocate space for A using omp_target_alloc().\n");
exit(1);
}
B = (double *) omp_target_alloc(bytes, device_id);
if (B == NULL){
printf(" ERROR: Cannot allocate space for B using omp_target_alloc().\n");
exit(1);
}
C = (double *) omp_target_alloc(bytes, device_id);
if (C == NULL){
printf(" ERROR: Cannot allocate space for C using omp_target_alloc().\n");
exit(1);
}
// Initialize the arrays
#pragma omp target teams distribute parallel for \
is_device_ptr(A,B,C)
for (size_t i=0; i<length; i++) {
A[i] = 2.0;
B[i] = 2.0;
C[i] = 0.0;
}
// Perform the computation
nstream_time = omp_get_wtime();
for (int iter = 0; iter<iterations; iter++) {
#pragma omp target teams distribute parallel for \
is_device_ptr(A,B,C)
for (size_t i=0; i<length; i++) {
C[i] += A[i] + scalar * B[i];
}
}
nstream_time = omp_get_wtime() - nstream_time;
// Validate and output results
double ar = 2.0;
double br = 2.0;
double cr = 0.0;
for (int iter = 0; iter<iterations; iter++) {
for (int i=0; i<length; i++) {
cr += ar + scalar * br;
}
}
double asum = 0.0;
#pragma omp target teams distribute parallel for reduction(+:asum) \
is_device_ptr(C)
for (size_t i=0; i<length; i++) {
asum += fabs(C[i]);
}
omp_target_free(A, device_id);
omp_target_free(B, device_id);
omp_target_free(C, device_id);
double epsilon=1.e-8;
if (fabs(cr - asum)/asum > epsilon) {
printf("Failed Validation on output array\n"
" Expected checksum: %lf\n"
" Observed checksum: %lf\n"
"ERROR: solution did not validate\n", cr, asum);
return 1;
} else {
printf("Solution validates\n");
double avgtime = nstream_time/iterations;
printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
}
return 0;
}
Notes
- When callingomp_target_alloc, the device number specified must be one of the supported devices, other than the host device. This will be the device on which storage will be allocated.
- Since the arraysA,B, andCare not accessible from the host, the initialization of the arrays, kernel execution, and summation of elements ofCall need to be done inside OpenMPtargetregions.
- A device allocation can only be accessed by the device specified in theomp_target_alloccall, but may be copied to memory allocated on the host or other devices by callingomp_target_alloc_memcpy.
omp_target_alloc_device
omp_target_alloc_device
The Intel extension
omp_target_alloc_device
is similar to
omp_target_alloc
. It is also called with two arguments: the number
of bytes to allocate on the device, and the number of the device on
which to allocate the storage. The routine returns a device pointer
that references the device address of the storage allocated on the
device. If the call to omp_target_alloc_device
returns NULL, then
this indicates that the allocation was not successful.The above
omp_target_alloc
example can be rewritten using
omp_target_alloc_device
by simply replacing the call to
omp_target_alloc
with a call to omp_targer_alloc_device
as shown
below.At the end of the program, the runtime routine
omp_target_free
is
used to deallocate the storage for A
, B
, and C
on the
device. // Allocate arrays in device memory
A = (double *) omp_target_alloc_device(bytes, device_id);
if (A == NULL){
printf(" ERROR: Cannot allocate space for A using omp_target_alloc_device().\n");
exit(1);
}
B = (double *) omp_target_alloc_device(bytes, device_id);
if (B == NULL){
printf(" ERROR: Cannot allocate space for B using omp_target_alloc_device().\n");
exit(1);
}
C = (double *) omp_target_alloc_device(bytes, device_id);
if (C == NULL){
printf(" ERROR: Cannot allocate space for C using omp_target_alloc_device().\n");
exit(1);
}
- Note::
- All of the above Notes that apply toomp_target_allocalso apply toomp_target_alloc_device.
omp_target_alloc_host
omp_target_alloc_host
The above example can also be rewritten by doing a host allocation for
A
, B
, and C
. This allows the memory to be accessible to
the host and all supported devices.In the following modified example, the
omp_target_alloc_host
runtime routine (an Intel extension) is called to allocate storage
for each of the arrays A
, B
, and C
. The routine takes two
arguments: the number of bytes to allocate, and a device number. The
device number must be one of the supported devices, other than the
host device. The routine returns a pointer to a storage location in
host memory. If the call to omp_target_alloc_host
returns NULL,
this indicates that the allocation was not successful.Note the directive
requires unified_address
is specified at the
top of the program. This requires that the implementation guarantee
that all devices accessible through OpenMP API routines and directives
use a unified address space. In this address space, a pointer will
always refer to the same location in memory from all devices, and the
is_device_ptr
clause is not necessary to obtain device addresses from
device pointers for use inside target regions. When using Intel
compilers, the requires unified_address
directive is actually not
needed, since unified address space is guaranteed by default. However,
for portability the code includes the directive.The pointer returned by a call to
omp_target_alloc_host
can be
used to access the storage from the host and all supported devices. No
map
clauses and no is_device_ptr
clauses are needed on a
target
construct to access the memory from a device since a
unified address space is used.At the end of the program, the runtime routine
omp_target_free
is
used to deallocate the storage for A
, B
, and C
.// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <omp.h>
#pragma omp requires unified_address
#define iterations 1000
#define length 64*1024*1024
int main(int argc, char * argv[])
{
int device_id = omp_get_default_device();
size_t bytes = length*sizeof(double);
double * __restrict A;
double * __restrict B;
double * __restrict C;
double scalar = 3.0;
double nstream_time = 0.0;
// Allocate arrays in host memory
A = (double *) omp_target_alloc_host(bytes, device_id);
if (A == NULL){
printf(" ERROR: Cannot allocate space for A using omp_target_alloc_host().\n");
exit(1);
}
B = (double *) omp_target_alloc_host(bytes, device_id);
if (B == NULL){
printf(" ERROR: Cannot allocate space for B using omp_target_alloc_host().\n");
exit(1);
}
C = (double *) omp_target_alloc_host(bytes, device_id);
if (C == NULL){
printf(" ERROR: Cannot allocate space for C using omp_target_alloc_host().\n");
exit(1);
}
// Initialize the arrays
#pragma omp parallel for
for (size_t i=0; i<length; i++) {
A[i] = 2.0;
B[i] = 2.0;
C[i] = 0.0;
}
// Perform the computation
nstream_time = omp_get_wtime();
for (int iter = 0; iter<iterations; iter++) {
#pragma omp target teams distribute parallel for
for (size_t i=0; i<length; i++) {
C[i] += A[i] + scalar * B[i];
}
}
nstream_time = omp_get_wtime() - nstream_time;
// Validate and output results
double ar = 2.0;
double br = 2.0;
double cr = 0.0;
for (int iter = 0; iter<iterations; iter++) {
for (int i=0; i<length; i++) {
cr += ar + scalar * br;
}
}
double asum = 0.0;
#pragma omp parallel for reduction(+:asum)
for (size_t i=0; i<length; i++) {
asum += fabs(C[i]);
}
omp_target_free(A, device_id);
omp_target_free(B, device_id);
omp_target_free(C, device_id);
double epsilon=1.e-8;
if (fabs(cr - asum)/asum > epsilon) {
printf("Failed Validation on output array\n"
" Expected checksum: %lf\n"
" Observed checksum: %lf\n"
"ERROR: solution did not validate\n", cr, asum);
return 1;
} else {
printf("Solution validates\n");
double avgtime = nstream_time/iterations;
printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
}
return 0;
}
Notes
- When callingomp_target_alloc_host, the device number specified must be one of the supported devices, other than the host device.
- Since the arraysA,B, andCare accessible from the host and device, the initialization of the arrays and summation of elements ofCmay be done either on the host (outside of atargetconstruct) or on the device (inside atargetconstruct).
omp_target_memcpy
omp_target_memcpy
The following example shows how the runtime routine
omp_target_memcpy
may be used to copy memory from host to
device, and from device to host. First arrays h_A
, h_B
, and h_C
are allocated in system memory using plain malloc
, and then
initialized. Corresponding arrays d_A
, d_B
, and d_C
are
allocated on the device using omp_target_alloc
.Before the start of the
target
construct on line 99, the values in
h_A
, h_B
, and h_C
are copied to d_A
, d_B
, and
d_C
by calling omp_target_memcpy
. After the target
region,
new d_C
values computed on the device are copied to h_C
by
calling omp_target_memcpy
.// clang-format off
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <omp.h>
#define iterations 1000
#define length 64*1024*1024
int main(int argc, char * argv[])
{
int device_id = omp_get_default_device();
int host_id = omp_get_initial_device();
size_t bytes = length*sizeof(double);
double * __restrict h_A;
double * __restrict h_B;
double * __restrict h_C;
double * __restrict d_A;
double * __restrict d_B;
double * __restrict d_C;
double scalar = 3.0;
double nstream_time = 0.0;
// Allocate arrays h_A, h_B, and h_C on the host using plain malloc()
h_A = (double *) malloc(bytes);
if (h_A == NULL){
printf(" ERROR: Cannot allocate space for h_A using plain malloc().\n");
exit(1);
}
h_B = (double *) malloc(bytes);
if (h_B == NULL){
printf(" ERROR: Cannot allocate space for h_B using plain malloc().\n");
exit(1);
}
h_C = (double *) malloc(bytes);
if (h_C == NULL){
printf(" ERROR: Cannot allocate space for h_C using plain malloc().\n");
exit(1);
}
// Allocate arrays d_A, d_B, and d_C on the device using omp_target_alloc()
d_A = (double *) omp_target_alloc(bytes, device_id);
if (d_A == NULL){
printf(" ERROR: Cannot allocate space for d_A using omp_target_alloc().\n");
exit(1);
}
d_B = (double *) omp_target_alloc(bytes, device_id);
if (d_B == NULL){
printf(" ERROR: Cannot allocate space for d_B using omp_target_alloc().\n");
exit(1);
}
d_C = (double *) omp_target_alloc(bytes, device_id);
if (d_C == NULL){
printf(" ERROR: Cannot allocate space for d_C using omp_target_alloc().\n");
exit(1);
}
// Initialize the arrays on the host
#pragma omp parallel for
for (size_t i=0; i<length; i++) {
h_A[i] = 2.0;
h_B[i] = 2.0;
h_C[i] = 0.0;
}
// Call omp_target_memcpy() to copy values from host to device
int rc = 0;
rc = omp_target_memcpy(d_A, h_A, bytes, 0, 0, device_id, host_id);
if (rc) {
printf("ERROR: omp_target_memcpy(A) returned %d\n", rc);
exit(1);
}
rc = omp_target_memcpy(d_B, h_B, bytes, 0, 0, device_id, host_id);
if (rc) {
printf("ERROR: omp_target_memcpy(B) returned %d\n", rc);
exit(1);
}
rc = omp_target_memcpy(d_C, h_C, bytes, 0, 0, device_id, host_id);
if (rc) {
printf("ERROR: omp_target_memcpy(C) returned %d\n", rc);
exit(1);
}
// Perform the computation
nstream_time = omp_get_wtime();
for (int iter = 0; iter<iterations; iter++) {
#pragma omp target teams distribute parallel for \
is_device_ptr(d_A,d_B,d_C)
for (size_t i=0; i<length; i++) {
d_C[i] += d_A[i] + scalar * d_B[i];
}
}
nstream_time = omp_get_wtime() - nstream_time;
// Call omp_target_memcpy() to copy values from device to host
rc = omp_target_memcpy(h_C, d_C, bytes, 0, 0, host_id, device_id);
if (rc) {
printf("ERROR: omp_target_memcpy(A) returned %d\n", rc);
exit(1);
}
// Validate and output results
double ar = 2.0;
double br = 2.0;
double cr = 0.0;
for (int iter = 0; iter<iterations; iter++) {
for (int i=0; i<length; i++) {
cr += ar + scalar * br;
}
}
double asum = 0.0;
#pragma omp parallel for reduction(+:asum)
for (size_t i=0; i<length; i++) {
asum += fabs(h_C[i]);
}
free(h_A);
free(h_B);
free(h_C);
omp_target_free(d_A, device_id);
omp_target_free(d_B, device_id);
omp_target_free(d_C, device_id);
double epsilon=1.e-8;
if (fabs(cr - asum)/asum > epsilon) {
printf("Failed Validation on output array\n"
" Expected checksum: %lf\n"
" Observed checksum: %lf\n"
"ERROR: solution did not validate\n", cr, asum);
return 1;
} else {
printf("Solution validates\n");
double avgtime = nstream_time/iterations;
printf("Checksum = %lf; Avg time (s): %lf\n", asum, avgtime);
}
return 0;
}
Performance Considerations
In the above examples (using the
map
clause, omp_target_alloc
,
omp_target_alloc_device
, omp_target_alloc_host
,
omp_target_alloc_shared
, omp_target_memcpy
), the main kernel
is the target
construct that computes the values of array C
.
To get more accurate timings, this target
construct is enclosed in
a loop, so the offload happens iterations
number of times (where
iterations
= 1000). Compute the average kernel time by dividing the
total time taken by the iterations
loop by 1000. // Perform the computation
nstream_time = omp_get_wtime();
for (int iter = 0; iter<iterations; iter++) {
#pragma omp target teams distribute parallel for \
is_device_ptr(A,B,C)
for (size_t i=0; i<length; i++) {
C[i] += A[i] + scalar * B[i];
}
}
nstream_time = omp_get_wtime() - nstream_time;
LIBOMPTARGET_DEBUG=1 output shows that all the above examples have the
same ND_range partitioning.
Libomptarget --> Launching target execution __omp_offloading_811_6e21ac8__Z4main_l55 with pointer 0x0000000001fc20c8 (index=1).
Libomptarget --> Manifesting used target pointers:
Target LEVEL0 RTL --> Executing a kernel 0x0000000001fc20c8...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 32
Target LEVEL0 RTL --> Preferred group size is multiple of 64
Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 67108863, Stride = 1
Target LEVEL0 RTL --> Group sizes = {64, 1, 1}
Target LEVEL0 RTL --> Group counts = {1048576, 1, 1}
The following table shows the average times taken by the kernel in the various
versions when running on the particular ATS GPU used (1-tile only).
Version | Time (seconds) |
---|---|
map | 0.139208 |
map + target data | 0.008491 |
omp_target_alloc | 0.007247 |
omp_target_alloc_device | 0.007484 |
omp_target_alloc_host | 0.088674 |
omp_target_alloc_shared | 0.008319 |
omp_target_memcpy | 0.008284 |
The above performance numbers show that the
map
version is the
slowest version (0.139208 seconds). This is because of the data
transfers that occur at the beginning and end of each kernel
launch. The main kernel is launched 1000 times. At the beginning of
each kernel launch, storage for arrays A
, B
and C
is
allocated on the device, and the values of these arrays are copied
from the host to the device. At the end of the kernel, the values of
array C
are copied from the device to the host. Putting the whole
iterations
loop inside a target data
construct with map
clauses reduced the runtime to 0.008491 seconds, because the transfers
occur once at the launch of the first kernel in the iterations
loop, and again after the last kernel in that loop.The
omp_target_alloc
and omp_target_alloc_device
versions have
the best performance (0.007247 and 0.007484 seconds, respectively). In
these versions, storage for A
, B
, and C
is allocated
directly in device memory, so accesses on the device happen from
device-local memory.The
omp_target_alloc_shared
version also performs well, but is
somewhat slower (0.008319 seconds). In this version, storage for
A
, B
, and C
is allocated in shared memory. So the data can
migrate between the host and devices. There is the overhead of
migration but, after migration, accesses on a device happen from much
faster device-local memory.The
omp_target_alloc_host
version takes almost 11x more time than
the omp_target_alloc_shared
version (0.088674 seconds). This is
because data allocated in host memory does not migrate from the host
to the device. Rather, the data is typically sent over a bus, such as
PCI Express, that connects the device to the host. Generally, data
should be allocated in host memory using omp_target_alloc_host
if
only a small amount of that data will be accessed on the device.Finally, a note regarding data transfers: The amount of data
transferred in the
map
version can be seen in
LIBOMPTARGET_DEBUG=1 output by grepping for "Libomptarget -->
Moving"
. Notice that each launch of the main kernel yields
the following data transfers:$ grep "Libomptarget --> Moving" test_target_map.debug
Libomptarget --> Moving 536870912 bytes (hst:0x00001477908f6010) -> (tgt:0xffffd556aaa00000)
Libomptarget --> Moving 536870912 bytes (hst:0x00001477d08f8010) -> (tgt:0xffffd556caa00000)
Libomptarget --> Moving 536870912 bytes (hst:0x00001477b08f7010) -> (tgt:0xffffd556eaa00000)
Libomptarget --> Moving 536870912 bytes (tgt:0xffffd556aaa00000) -> (hst:0x00001477908f6010)
On the other hand, data transfers in the
omp_target_alloc_...
versions are handled by a lower layer of the runtime system. So
grepping for "Libomptarget --> Moving"
in LIBOMPTARGET_DEBUG=1
output for these versions will not show the data transfers that took
place.Fortran
The Fortran version of the example using
target data
and map
clauses is shown below.program main
use iso_fortran_env
use omp_lib
implicit none
integer, parameter :: iterations=1000
integer, parameter :: length=64*1024*1024
real(kind=REAL64), parameter :: epsilon=1.D-8
real(kind=REAL64), allocatable :: A(:)
real(kind=REAL64), allocatable :: B(:)
real(kind=REAL64), allocatable :: C(:)
real(kind=REAL64) :: scalar=3.0
real(kind=REAL64) :: ar, br, cr, asum
real(kind=REAL64) :: nstream_time, avgtime
integer :: err, i, iter
!
! Allocate arrays on the host using plain allocate
allocate( A(length), stat=err )
if (err .ne. 0) then
print *, "Allocation of A returned ", err
stop 1
endif
allocate( B(length), stat=err )
if (err .ne. 0) then
print *, "Allocation of B returned ", err
stop 1
endif
allocate( C(length), stat=err )
if (err .ne. 0) then
print *, "Allocation of C returned ", err
stop 1
endif
!
! Initialize the arrays
!$omp parallel do
do i = 1, length
A(i) = 2.0
B(i) = 2.0
C(i) = 0.0
end do
!
! Perform the computation
nstream_time = omp_get_wtime()
!$omp target data map(to: A, B) map(tofrom: C)
do iter = 1, iterations
!$omp target teams distribute parallel do
do i = 1, length
C(i) = C(i) + A(i) + scalar * B(i)
end do
end do
!$omp end target data
nstream_time = omp_get_wtime() - nstream_time
!
! Validate and output results
ar = 2.0
br = 2.0
cr = 0.0
do iter = 1, iterations
do i = 1, length
cr = cr + ar + scalar * br
end do
end do
asum = 0.0
!$omp parallel do reduction(+:asum)
do i = 1, length
asum = asum + abs(C(i))
end do
if (abs(cr - asum)/asum > epsilon) then
print *, "Failed Validation on output array: ", "Expected =", cr, "Observed =", asum
else
avgtime = nstream_time/iterations
print *, "Solution validates: ", "Checksum =", asum, "Avg time (s) =", avgtime
endif
deallocate(A)
deallocate(B)
deallocate(C)
end program main
The Fortran version of the example using
omp_target_alloc_device
is shown below. In this example, allocate
directives, with the
allocator omp_target_device_mem_alloc
, are used to allocate arrays
A
, B
, and C
on the device. The use_device_addr(A, B,
C)
clause is used on the target data
directive (line 32) to
indicate that the arrays have device addresses, and these addresses
should be used in the target
region.program main
use iso_fortran_env
use omp_lib
implicit none
integer, parameter :: iterations=1000
integer, parameter :: length=64*1024*1024
real(kind=REAL64), parameter :: epsilon=1.D-8
real(kind=REAL64), allocatable :: A(:)
real(kind=REAL64), allocatable :: B(:)
real(kind=REAL64), allocatable :: C(:)
real(kind=REAL64) :: scalar=3.0
real(kind=REAL64) :: ar, br, cr, asum
real(kind=REAL64) :: nstream_time, avgtime
integer :: i, iter
!
! Allocate arrays in device memory
!$omp allocate allocator(omp_target_device_mem_alloc)
allocate(A(length))
!$omp allocate allocator(omp_target_device_mem_alloc)
allocate(B(length))
!$omp allocate allocator(omp_target_device_mem_alloc)
allocate(C(length))
!
! Begin target data
!$omp target data use_device_addr(A, B, C)
!
! Initialize the arrays
!$omp target teams distribute parallel do
do i = 1, length
A(i) = 2.0
B(i) = 2.0
C(i) = 0.0
end do
!
! Perform the computation
nstream_time = omp_get_wtime()
do iter = 1, iterations
!$omp target teams distribute parallel do
do i = 1, length
C(i) = C(i) + A(i) + scalar * B(i)
end do
end do
nstream_time = omp_get_wtime() - nstream_time
!
! Validate and output results
ar = 2.0
br = 2.0
cr = 0.0
do iter = 1, iterations
do i = 1, length
cr = cr + ar + scalar * br
end do
end do
asum = 0.0
!$omp target teams distribute parallel do reduction(+:asum)
do i = 1, length
asum = asum + abs(C(i))
end do
!
! End target data
!$omp end target data
if (abs(cr - asum)/asum > epsilon) then
print *, "Failed Validation on output array:", "Expected =", cr, "Observed =", asum
else
avgtime = nstream_time/iterations
print *, "Solution validates:", "Checksum =", asum, "Avg time (s) =", avgtime
endif
deallocate(A)
deallocate(B)
deallocate(C)
end program main