Optimize

Intel® oneAPI Programming Guide

Download PDF

ID 771723

Date 12/17/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Optimize

oneAPI enables functional code that can execute on multiple accelerators; however, the code may not be the most optimal across the accelerators. A three-step optimization strategy is recommended to meet performance needs:

Pursue general optimizations that apply across accelerators.
Optimize aggressively for the prioritized accelerators.
Optimize the host code in conjunction with step 1 and 2.

Optimization is a process of eliminating bottlenecks, i.e. the sections of code that are taking more execution time relative to other sections of the code. These sections could be executing on the devices or the host. During optimization, employ a profiling tool such as Intel® VTune™ Profiler to find these bottlenecks in the code.

This section discusses the first step of the strategy - Pursue general optimizations that apply across accelerators. Device specific optimizations and best practices for specific devices (step 2) and optimizations between the host and devices (step 3) are detailed in device-specific optimization guides, such as the FPGA Optimization Guide for Intel® oneAPI Toolkits. This section assumes that the kernel to offload to the accelerator is already determined. It also assumes that work will be accomplished on one accelerator. This guide does not speak to division of work between host and accelerator or between host and potentially multiple and/or different accelerators.

General optimizations that apply across accelerators can be classified into four categories:

High-level optimizations
Loop-related optimizations
Memory-related optimizations
SYCL-specific optimizations

The following sections summarize these optimizations only; specific details on how to code most of these optimizations can be found online or in commonly available code optimization literature. More detail is provided for the SYCL-specific optimizations.

High-level Optimization Tips

Increase the amount of parallel work. More work than the number of processing elements is desired to help keep the processing elements more fully utilized.
Minimize the code size of kernels. This helps keep the kernels in the instruction cache of the accelerator, if the accelerator contains one.
Load balance kernels. Avoid significantly different execution times between kernels as the long-running kernels may become bottlenecks and affect the throughput of the other kernels.
Avoid expensive functions. Avoid calling functions that have high execution times as they may become bottlenecks.

SYCL-specific Optimizations

When possible, specify a work-group size. The attribute, [[cl::reqd_work_group_size(X, Y, Z)]], where X, Y, and Z are integer dimension in the ND-range, can be employed to set the work-group size. The compiler can take advantage of this information to optimize more aggressively.
Consider use of the -Xsfp-relaxed option when possible. This option relaxes the order of arithmetic floating-point operations.
Consider use of the -Xsfpc option when possible. This option removes intermediary floating-point rounding operations and conversions whenever possible and carries additional bits to maintain precision.
Consider use of the -Xsno-accessor-aliasing option. This option ignores dependencies between accessor arguments in a SYCL* kernel.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Programming Guide

Optimize

High-level Optimization Tips

SYCL-specific Optimizations

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Programming Guide

Optimize

High-level Optimization Tips

Loop-related Optimizations

Memory-related Optimizations

SYCL-specific Optimizations