A newer version of this document is available. Customers should click here to go to the newest version.
oneAPI enables functional code that can execute on multiple accelerators; however, the code may not be the most optimal across the accelerators. A three-step optimization strategy is recommended to meet performance needs:
Pursue general optimizations that apply across accelerators.
Optimize aggressively for the prioritized accelerators.
Optimize the host code in conjunction with step 1 and 2.
Optimization is a process of eliminating bottlenecks, i.e. the sections of code that are taking more execution time relative to other sections of the code. These sections could be executing on the devices or the host. During optimization, employ a profiling tool such as Intel® VTune™ Profiler to find these bottlenecks in the code.
This section discusses the first step of the strategy - Pursue general optimizations that apply across accelerators. Device specific optimizations and best practices for specific devices (step 2) and optimizations between the host and devices (step 3) are detailed in device-specific optimization guides, such as the FPGA Optimization Guide for Intel® oneAPI Toolkits. This section assumes that the kernel to offload to the accelerator is already determined. It also assumes that work will be accomplished on one accelerator. This guide does not speak to division of work between host and accelerator or between host and potentially multiple and/or different accelerators.
General optimizations that apply across accelerators can be classified into four categories:
The following sections summarize these optimizations only; specific details on how to code most of these optimizations can be found online or in commonly available code optimization literature. More detail is provided for the SYCL-specific optimizations.
High-level Optimization Tips
Increase the amount of parallel work. More work than the number of processing elements is desired to help keep the processing elements more fully utilized.
Minimize the code size of kernels. This helps keep the kernels in the instruction cache of the accelerator, if the accelerator contains one.
Load balance kernels. Avoid significantly different execution times between kernels as the long-running kernels may become bottlenecks and affect the throughput of the other kernels.
Avoid expensive functions. Avoid calling functions that have high execution times as they may become bottlenecks.
Prefer well-structured, well-formed, and simple exit condition loops – these are loops that have a single exit and a single condition when comparing against an integer bound.
Prefer loops with linear indexes and constant bounds – these are loops that employ an integer index into an array, for example, and have bounds that are known at compile-time.
Declare variables in deepest scope possible. Doing so can help reduce memory or stack usage.
Minimize or relax loop-carried data dependencies. Loop-carried dependencies can limit parallelization. Remove dependencies if possible. If not, pursue techniques to maximize the distance between the dependency and/or keep the dependency in local memory.
Unroll loops with pragma unroll.
When possible, favor greater computation over greater memory use. The latency and bandwidth of memory compared to computation can become a bottleneck.
When possible, favor greater local and private memory use over global memory use.
Avoid pointer aliasing.
Coalesce memory accesses. Grouping memory accesses helps limit the number of individual memory requests and increases utilization of individual cache lines.
When possible, store variables and arrays in private memory for high-execution areas of code.
Beware of loop unrolling effects on concurrent memory accesses.
Avoid a write to a global that another kernel reads. Use a pipe instead.
Consider employing the [[intel::kernel_args_restrict]] attribute to a kernel. The attribute allows the compiler to ignore dependencies between accessor arguments in the kernel. In turn, ignoring accessor argument dependencies allows the compiler to perform more aggressive optimizations and potentially improve the performance of the kernel.
When possible, specify a work-group size. The attribute, [[cl::reqd_work_group_size(X, Y, Z)]], where X, Y, and Z are integer dimension in the ND-range, can be employed to set the work-group size. The compiler can take advantage of this information to optimize more aggressively.
Consider use of the -Xsfp-relaxed option when possible. This option relaxes the order of arithmetic floating-point operations.
Consider use of the -Xsfpc option when possible. This option removes intermediary floating-point rounding operations and conversions whenever possible and carries additional bits to maintain precision.
Consider use of the -Xsno-accessor-aliasing option. This option ignores dependencies between accessor arguments in a SYCL* kernel.
Did you find the information on this page useful?