Developer Reference

GPU Buffers Support

Short Description
This feature enables handling of device buffers in MPI functions such as
MPI_Send
,
MPI_Recv
,
MPI_Bcast
,
MPI_Allreduce
, and so on by using the Level Zero* library specified in the
I_MPI_OFFLOAD_LEVEL_ZERO_LIBRARY
variable.
Tto pass a pointer of an offloaded memory region to MPI, you may need to use specific compiler directives or get it from corresponding acceleration runtime API. For example,
use_device_ptr
and
use_device_addr
are useful keywords to obtain device pointers in the OpenMP environment, as shown in the following code sample:
/* Copy data from host to device */ #pragma omp target data map(to: rank, values[0:num_values]) use_device_ptr(values) {     /* Compute something on device */     #pragma omp target parallel for is_device_ptr(values)     for (unsigned i = 0; i < num_values; ++i) {         values[i] *= (rank + 1);     }     /* Send device buffer to another rank */     MPI_Send(values, num_values, MPI_INT, dest_rank, tag, MPI_COMM_WORLD); }
To achieve the best performance, use the same GPU buffer in MPI communications if possible. It helps Intel® MPI Library cache necessary structures to communicate with the device and reuse them in next iterations.
Set
I_MPI_OFFLOAD=0
to disable this feature if you do not provide device buffers to MPI primitives, since handling of device buffers can affect performance.

I_MPI_OFFLOAD_MEMCPY

Set this environment variable to select the GPU memcpy kind
Syntax
I_MPI_OFFLOAD_MEMCPY=<
value
>
Arguments
<
value
>
Description
cached
   Cache created objects for communication with GPU so that they can be reused if the same device buffer is later provided to the MPI function. Default value.
blocked
  Copy device buffer to host and wait for the copy to be completed inside MPI function.
nonblocked
  Copy device buffer to host and do not wait for the copy to be completed inside MPI function. Wait for the operation completion in MPI_Wait.
Description
Set this environment variable to select the GPU memcpy kind. The best performed option is chosen by default. Nonblocked memcpy can be used with MPI non-blocked point-to-point operations to achieve the overlap with compute part. Blocked memcpy can be used if other types are not stable.

I_MPI_OFFLOAD_PIPLINE

Set this environment variable to enable pipeline algorithm.
Syntax
I_MPI_OFFLOAD_PIPELINE=<
value
>
Arguments
<
value
>
Description
1
  Enable pipeline algorithm. Default value.
0
  Disable pipeline algorithm.
Description
Set this environment variable to enable pipeline algorithm, which can improve performance for large message sizes. The main idea of the algorithm is to split user buffer into several segment, and copy the segments to the host and send them to another rank.

I_MPI_OFFLOAD_PIPLINE_THRESHOLD

Set this environment variable to control the threshold for pipeline algorithm.
Syntax
I_MPI_OFFLOAD_PIPELINE_THRESHOLD=<
value
>
Arguments
<
value
>
Description
>0
  Threshold in bytes. The default value is 524288
Description
This variable controls the message size from which the pipeline algorithm is used.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.