Programming Guide



oneAPI enables functional code that can execute on multiple accelerators; however, the code may not be the most optimal across the accelerators. A three-step optimization strategy is recommended to meet performance needs:
  1. Pursue general optimizations that apply across accelerators.
  2. Optimize aggressively for the prioritized accelerators.
  3. Optimize the host code in conjunction with step 1 and 2.
Optimization is a process of eliminating bottlenecks, i.e. the sections of code that are taking more execution time relative to other sections of the code. These sections could be executing on the devices or the host. During optimization, employ a profiling tool such as Intel® VTune™ Profiler to find these bottlenecks in the code.
This section discusses the first step of the strategy - Pursue general optimizations that apply across accelerators. Device specific optimizations and best practices for specific devices (step 2) and optimizations between the host and devices (step 3) are detailed in device-specific optimization guides, such as the FPGA Optimization Guide for Intel® oneAPI Toolkits. This section assumes that the kernel to offload to the accelerator is already determined. It also assumes that work will be accomplished on one accelerator. This guide does not speak to division of work between host and accelerator or between host and potentially multiple and/or different accelerators.
General optimizations that apply across accelerators can be classified into four categories:
  1. High-level optimizations
  2. Loop-related optimizations
  3. Memory-related optimizations
  4. SYCL-specific optimizations
The following sections summarize these optimizations only; specific details on how to code most of these optimizations can be found online or in commonly available code optimization literature. More detail is provided for the SYCL-specific optimizations.

High-level Optimization Tips

  • Increase the amount of parallel work. More work than the number of processing elements is desired to help keep the processing elements more fully utilized.
  • Minimize the code size of kernels. This helps keep the kernels in the instruction cache of the accelerator, if the accelerator contains one.
  • Load balance kernels. Avoid significantly different execution times between kernels as the long-running kernels may become bottlenecks and affect the throughput of the other kernels.
  • Avoid expensive functions. Avoid calling functions that have high execution times as they may become bottlenecks.

SYCL-specific Optimizations

  • When possible, specify a work-group size. The attribute,
    [[cl::reqd_work_group_size(X, Y, Z)]]
    , where X, Y, and Z are integer dimension in the ND-range, can be employed to set the work-group size. The compiler can take advantage of this information to optimize more aggressively.
  • Consider use of the
    option when possible. This option relaxes the order of arithmetic floating-point operations.
  • Consider use of the
    option when possible. This option removes intermediary floating-point rounding operations and conversions whenever possible and carries additional bits to maintain precision.
  • Consider use of the
    option. This option ignores dependencies between accessor arguments in a SYCL* kernel.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at