Optimize
oneAPI enables functional code that can execute on multiple
accelerators; however, the code may not be the most optimal across the
accelerators. A three-step optimization strategy is recommended to meet
performance needs:
- Pursue general optimizations that apply across accelerators.
- Optimize aggressively for the prioritized accelerators.
- Optimize the host code in conjunction with step 1 and 2.
Optimization is a process of eliminating bottlenecks, i.e. the sections
of code that are taking more execution time relative to other sections
of the code. These sections could be executing on the devices or the
host. During optimization, employ a profiling tool such as Intel® VTune™
Profiler to find these bottlenecks in the code.
This section discusses the first step of the strategy - Pursue general
optimizations that apply across accelerators. Device specific
optimizations and best practices for specific devices (step 2) and
optimizations between the host and devices (step 3) are detailed in
device-specific optimization guides, such as the FPGA Optimization Guide for Intel® oneAPI Toolkits.
This section assumes that the kernel to offload to the accelerator is
already determined. It also assumes that work will be accomplished on
one accelerator. This guide does not speak to division of work between
host and accelerator or between host and potentially multiple and/or
different accelerators.
General optimizations that apply across accelerators can be classified
into four categories:
- High-level optimizations
- Loop-related optimizations
- Memory-related optimizations
- SYCL-specific optimizations
The following sections summarize these optimizations only; specific
details on how to code most of these optimizations can be found online
or in commonly available code optimization literature. More detail is
provided for the SYCL-specific optimizations.
High-level Optimization Tips
- Increase the amount of parallel work. More work than the number of processing elements is desired to help keep the processing elements more fully utilized.
- Minimize the code size of kernels. This helps keep the kernels in the instruction cache of the accelerator, if the accelerator contains one.
- Load balance kernels. Avoid significantly different execution times between kernels as the long-running kernels may become bottlenecks and affect the throughput of the other kernels.
- Avoid expensive functions. Avoid calling functions that have high execution times as they may become bottlenecks.
SYCL-specific Optimizations
- When possible, specify a work-group size. The attribute,[[cl::reqd_work_group_size(X, Y, Z)]], where X, Y, and Z are integer dimension in the ND-range, can be employed to set the work-group size. The compiler can take advantage of this information to optimize more aggressively.
- Consider use of the-Xsfp-relaxedoption when possible. This option relaxes the order of arithmetic floating-point operations.
- Consider use of the-Xsfpcoption when possible. This option removes intermediary floating-point rounding operations and conversions whenever possible and carries additional bits to maintain precision.
- Consider use of the-Xsno-accessor-aliasingoption. This option ignores dependencies between accessor arguments in a SYCL* kernel.