Designing high-performance software requires you to “think
differently” than you might normally do when writing software. You
need to be aware of the hardware on which your code is intended to
run, and the characteristics which control the performance of that
hardware. Your goal is to structure the code such that it produces
correct answers, but does so in a way that maximizes the hardware’s
ability to execute the code.
oneAPI is a cross-industry, open, standards-based unified programming
model that delivers a common developer experience across accelerator
architectures. A unique feature of accelerators is that they are
additive to the main CPU on the platform. The primary benefit of
using an accelerator is to improve the behavior of your software by
partitioning it across the host and accelerator to specialize portions
of the computation that run best on the accelerator. Accelerator
architectures can offer a benefit through specialization of compute
hardware for certain classes of computations. This enables them to
deliver best results for software specialized to the accelerator
The primary focus of this document is GPUs. Each section focuses on
different topics to guide you in your path to creating optimized
solutions. The Intel
oneAPI toolkits provide the languages and
development tools you will use to optimize your code. This includes
compilers, debuggers, profilers, analyzers, and libraries.
Productive Performance not Performance Portability
While this document focuses on GPUs, you may also need your
application to run on CPUs and other types of accelerators. Since
accelerator architectures are specialized, you need to specialize your
code to achieve best performance. Specialization includes
restructuring and tuning the code to create the best mapping of the
application to the hardware. In extreme cases, this may require
redesigning your algorithms for each accelerator to best expose the
right type of computation. The value of oneAPI is that it allows each
of these variations to be expressed in a common language with
device-specific variants launched on the appropriate accelerator.
Phases in the Optimization Workflow
The first phase in using a GPU is to identify which parts of the
application can benefit. This is usually compute-intensive code that
has the right ratio of memory accesses to computation, and has the
right data dependence patterns to map onto the GPU. GPUs include
local memory and typically provide massive parallelism. This
determines which characteristics of the code are most important when
deciding what to offload.
Advisor tool included in the Intel
Toolkit is designed to analyze your code and help you identify the
best opportunities for parallel execution. The profilers in Advisor
measure the data movement in your functions, the memory access
patterns, and the amount of computation in order to project how code
will perform when mapped onto different accelerators. The regions
with highest potential benefit should be your first targets for
GPUs often exploit parallelism at multiple levels. This includes
overlap between host and GPU, parallelism across the compute
cores, overlap between compute and memory accesses, concurrent
pipelines, and vector computation. Using all these levels of
parallelism requires a good understanding of the GPU architecture and
capabilities in the libraries and languages at one’s disposal.
Keep all the compute resources busy.
There must be enough
independent tasks to saturate the device and fully utilize all
execution resources. For example, if the device has 100 compute cores
but you only have one task, 99% of the device will be idle. Often you
create many more independent tasks than available compute resources so
that the hardware can schedule more work as prior tasks complete.
Minimize the synchronization between the host and the device.
host launches a kernel on the device and waits for its
completion. Launching a kernel incurs overhead, so structure the
computation to minimize the number of times a kernel is launched.
Minimize the data transfer between host and device.
starts on the host and is copied to the device as input to the
computation. When a computation is finished, the results must be
transferred back to the host. For best performance, minimize data
transfer by keeping intermediate results on the device between
computations. Reduce the impact of data transfer by overlapping
computation and data movement so the compute cores never have to wait
Keep the data in faster memory and use an appropriate access
GPU architectures have different types of memory
which have different access costs. Registers, caches, and scratchpads
are cheaper to access than local memory, but have smaller
capacity. When data is loaded into a register, cache line, or memory
page, use an access pattern that will use all the data before moving
to the next chunk. When memory is banked, use a stride that avoids all
the compute cores trying to access the same memory bank
Profiling and Tuning your Code
After you have designed your code for high performance, the next step
is to measure how it runs on the target accelerator. Add timers to the
code, collect traces, and use tools like Intel
observe the program as it runs. The information collected can identify
where hardware is bottlenecked and idle, illustrate how behavior
compares with peak hardware roofline, and identify the most important
hotspots to focus optimization effort.