Introduction
Designing high-performance software requires you to think
differently than you might normally do when writing software. You
need to be aware of the hardware on which your code is intended to
run, and the characteristics that control the performance of that
hardware. Your goal is to structure the code such that it produces
correct answers, but does so in a way that maximizes the hardware’s
ability to execute the code.
oneAPI is a cross-industry, open, standards-based, unified programming
model that delivers a common developer experience across accelerator
architectures. A unique feature of accelerators is that they are
additive to the main CPU on the platform. The primary benefit of
using an accelerator is to improve the behavior of your software by
partitioning it across the host and accelerator to specialize portions
of the computation that run best on the accelerator. Accelerator
architectures can offer a benefit through specialization of compute
hardware for certain classes of computations. This enables them to
deliver best results for software specialized to the accelerator
architecture.
The primary focus of this document is GPUs. Each section focuses on
different topics to guide you in your path to creating optimized
solutions. The Intel
®
oneAPI toolkits provide the languages and
development tools you will use to optimize your code. This includes
compilers, debuggers, profilers, analyzers, and libraries.Productive Performance Not Performance Portability
While this document focuses on GPUs, you may also need your
application to run on CPUs and other types of accelerators. Since
accelerator architectures are specialized, you need to specialize your
code to achieve best performance. Specialization includes
restructuring and tuning the code to create the best mapping of the
application to the hardware. In extreme cases, this may require
redesigning your algorithms for each accelerator to best expose the
right type of computation. The value of oneAPI is that it allows each
of these variations to be expressed in a common language with
device-specific variants launched on the appropriate accelerator.
Phases in the Optimization Workflow
The first phase in using a GPU is to identify which parts of the
application can benefit. This is usually compute-intensive code that
has the right ratio of memory accesses to computation, and has the
right data dependence patterns to map onto the GPU. GPUs include
local memory and typically provide massive parallelism. This
determines which characteristics of the code are most important when
deciding what to offload.
The Intel Advisor tool included in the Intel oneAPI Base Toolkit is designed to
analyze your code and help you identify the best opportunities for parallel
execution. The profilers in Intel Advisor measure the data movement in your
functions, the memory access patterns, and the amount of computation in order
to project how code will perform when mapped onto different accelerators. The
regions with highest potential benefit should be your first targets for
acceleration.
GPUs often exploit parallelism at multiple levels. This includes
overlap between host and GPU, parallelism across the compute
cores, overlap between compute and memory accesses, concurrent
pipelines, and vector computation. Using all of these levels of
parallelism requires a good understanding of the GPU architecture and
capabilities in the libraries and languages at your disposal.
Keep all the compute resources busy.
There must be enough
independent tasks to saturate the device and fully utilize all
execution resources. For example, if the device has 100 compute cores
but you only have one task, 99% of the device will be idle. Often you
create many more independent tasks than available compute resources so
that the hardware can schedule more work as prior tasks complete.Minimize the synchronization between the host and the device.
The
host launches a kernel on the device and waits for its
completion. Launching a kernel incurs overhead, so structure the
computation to minimize the number of times a kernel is launched.Minimize the data transfer between host and device.
Data typically
starts on the host and is copied to the device as input to the
computation. When a computation is finished, the results must be
transferred back to the host. For best performance, minimize data
transfer by keeping intermediate results on the device between
computations. Reduce the impact of data transfer by overlapping
computation and data movement so the compute cores never have to wait
for data.Keep the data in faster memory and use an appropriate access pattern.
GPU
architectures have different types of memory and these have different access
costs. Registers, caches, and scratchpads are cheaper to access than local
memory, but have smaller capacity. When data is loaded into a register, cache
line, or memory page, use an access pattern that will use all the data before
moving to the next chunk. When memory is banked, use a stride that avoids all
the compute cores trying to access the same memory bank simultaneously.Profiling and Tuning Your Code
After you have designed your code for high performance, the next step is to
measure how it runs on the target accelerator. Add timers to the code, collect
traces, and use tools like VTune Profiler to observe the program as it runs.
The information collected can identify where hardware is bottlenecked and idle,
illustrate how behavior compares with peak hardware roofline, and identify the
most important hotspots to focus optimization efforts.