Offload performance optimization basically boils down to three tasks:
Minimize the number and size of data transfers to and from the device while maximizing execution time of the kernel on the device.
When possible, overlap data transfers to/from the device with computation on the device.
Maximize the performance of the kernel on the device.
While it is possible to take explicit control of data transfers in both OpenMP* offload and SYCL*, you also can allow this to happen automatically. In addition, because the host and offload device operate mostly asynchronously, even if you try to take control over data transfers, the transfers may not happen in the expected order, and may take longer than anticipated. When data used by both the device and the host is stored in unified shared memory (USM), there is another transparent layer of data transfers happening that also can affect performance.
Buffer Transfer Time vs Execution Time
Transferring any data to or from an offload device is relatively expensive, requiring memory allocations in user space, system calls, and interfacing with hardware controllers. Unified shared memory (USM) adds to these costs by requiring that some background process keeps memory being modified on either the host or offload device in sync. Furthermore, kernels on the offload device must wait to run until all the input or output buffers they need to run are set up and ready to use.
All this overhead is roughly the same no matter how much information you need to transfer to or from the offload device in a single data transfer. Thus, it is much more efficient to transfer 10 numbers in bulk rather than one at a time. Still, every data transfer is expensive, so minimizing the total number of transfers is also very important. If, for example, you have some constants that are needed by multiple kernels, or during multiple invocations of the same kernel, transfer them to the offload device once and reuse them, rather than sending them with every kernel invocation. Finally, as might be expected, single large data transfers take more time than single small data transfers.
The number and size of buffers sent is only part of the equation. Once the data is at the offload device, consider how long the resulting kernel executes. If it runs for less time than it takes to transfer the data to the offload device, it may not be worthwhile to offload the data in the first place unless the time to do the same operation on the host is longer than the combined kernel execution and data transfer time.
Finally, consider how long the offload device is idle between the execution of one kernel and the next. A long wait could be due data transfer or just the nature of the algorithm on the host. If the former, it may be worthwhile to overlap data transfer and kernel execution, if possible.
In short, execution of code on the host, execution of code on the offload device, and data transfer is quite complex. The order and time of such operations isn’t something you can gain through intuition, even in the simplest code. You need to make use of tools like those listed below to get a visual representation of these activities and use that information to optimize your offload code.
Intel® VTune™ Profiler
In addition to giving you detailed performance information on the host, VTune can also provide detailed information about performance on a connected GPU. Setup information for GPUs is available from the Intel VTune Profiler User Guide.
Intel VTune Profiler’s GPU Offload view gives you an overview of the hotspots on the GPU, including the amount of time spent for data transfer to and from each kernel. The GPU Compute/Media Hotspots view allows you to dive more deeply into what is happening to your kernels on the GPU, such as by using the Dynamic Instruction Count to view a micro analysis of the GPU kernel performance. With these profiling modes, you can observe how data transfer and compute occur over time, determine if there is enough work for a kernel to run effectively, learn how your kernels use the GPU memory hierarchy, and so on.
Additional details about these analysis types is available from the Intel VTune Profiler User Guide. A detailed look at optimizing for GPU using VTune Profiler is available from the Optimize Applications for Intel GPUs with Intel VTune Profiler page.
You can also use Intel VTune Profiler to capture kernel execution time. The following commands provide light-weight profiling results:
Intel® Advisor provides two features that can help you get the improved performance when offloading computation to GPU:
Offload Modeling can watch your host OpenMP* program and recommend parts of it that would be profitably offloaded to the GPU. It also allows you to model a variety of different target GPUs, so that you can learn if offload will be profitable on some but not others. Offload Advisor gives detailed information on what factors may bound offload performance.
GPU Roofline analysis can watch your application when it runs on the GPU, and graphically show how well each kernel is making use of the memory subsystem and compute units on the GPU. This can let you know how well your kernel is optimized for the GPU.
To run these modes on an application that already does some offload, you need to set up your environment to use the OpenCL™ device on the CPU for analysis. Instructions are available from the Intel Advisor User Guide.
Offload modeling does not require that you have already modified your application to use a GPU - it can work entirely on host code.
Offload API call Timelines
If you do not want to use Intel® VTune™ Profiler to understand when data is being copied to the GPU, and when kernels run, onetrace, ze_tracer, cl_tracer, and the Intercept Layer for OpenCL™ Applications give you a way to observe this information /(although, if you want a graphical timeline, you’ll need to write a script to visualize the output/). For more information, see oneAPI Debug Tools, Trace the Offload Process, and Debug the Offload Process.