Performance Optimization Scenarios using GPU Roofline
The Roofline chart does not directly indicate what you need to change in
your code to make it run faster on a GPU (although it provides some code
hints, or guidance), but it shows a memory locality pattern(s) that
dominate in your algorithm implementation. By examining where the kernel
dots are located on the chart in relation to memory levels, you can
identify the memory stage that is too narrow for the data flow and is
the bottleneck. With this information, you can modify the data pattern
used in your algorithm and apply, for example, using data blocking to
reuse cache, avoiding multiple unnecessary data reads.
The more experience you have, the better you can understand data
patterns, but there are basic cases that we can examine. Although,
real-life applications do not usually show extreme behavior, like purely
bound to a certain roof, as they are affected by:
Auxiliary data transferred between memory and EUs, such as indexes of
work-group item calculations, memory addresses arithmetic, loop
Data caching being more complicated as it is affected by the
EU thread scheduling affecting data management
Let us consider several real-life examples of different applications and
their implementations similar the theoretical cases discussed earlier.
Data Block Optimized for the Matrix Multiply Algorithm with no SLM Usage
This implementation is a naïve matrix multiply without data blocking
and is similar to the optimized kernel and data flow optimization case.
Even though data is not organized in blocks in the source code, the
compiler recognizes the pattern and optimizes access to matrix arrays.
As a result, we have a high level of data reuse in cache, and kernel
performance is limited by the L3 cache. The Roofline chart shows one dot
corresponding to a single kernel in the application. Based on its
location on the chart:
The kernel is memory bound, and the corresponding dot it close to L3
GTI data traffic is four times smaller than the L3 cache data
traffic, which indicates high data reuse.
CARM and L3 traffics are almost the same, which indirectly indicates
100% of cache line usage because cache lines are fully filled with
To confirm the 100% of cache line usage, review the L3 Cache Line
Utilization metric in the
pane grid, which is 100%. The grid
also reports EU Threading Occupancy of 98.1%, which indicates good
scheduling and data distribution among threads.
To understand the limitations for future kernel code optimization,
review the following data reported by the Intel Advisor:
The Roofline Guidance pane shows kernel limitation summary and
provides estimation for possible performance speedup after
optimization. For the matrix multiply kernel, the main limitation
is the L3 cache bandwidth. The kernel can run 1.4x faster if it
uses the L3 cache more effectively and reaches its maximum
bandwidth with the same arithmetic intensity, but a better data
The Memory Metrics pane can help you understand memory level
impact, which is time spent in requests from different memory
levels, in per cent to the total time. For this kernel, GTI has
less impact that L3 cache, but it is still taking a big part of
total kernel execution time and may become a bottleneck after the
L3 bandwidth limits are eliminated, for example, using SLM.
hares metric is a visual way of estimating data portions processed
from different memory levels. In this case, L3 cache has 4x more
data than GTI.
The OP/S and Bandwidth pane shows the number of measured operations
per second and data traffic in relation to the bandwidth limitations.
For this kernel, the summary reports the following data:
The theoretical SLM bandwidth is almost 3x times higher than the
L3 cache bandwidth, but the SLM is not used in this
implementation. Blocking matrix arrays to use them as local shared
data can eliminate the L3 cache bandwidth limits.
The kernel performance is only 27% of theoretically possible peak
performance for Int32 data. With better memory access
implementation, we could potentially reach 3x performance increase
for this kernel.
Data Block Optimized Matrix Multiply Algorithm with SLM
Following the recommendations from the previous Intel Advisor
result, we split the matrix arrays into small blocks to implement matrix
multiplication data blocking and put the data blocks to the SLM for
faster data reuse on a sub-slice level.
For this optimized implementation with data blocking, the Roofline chart
looks as follows:
The data distribution has changed from the previous result. Firstly, the
execution is not limited to memory, but is compute bound, which is good
for overall performance and further optimizations.
There are a couple things to note in the memory-level dots:
SLM traffic is much bigger than L3 traffic. L3 traffic is not zero,
which is expected as data blocks are read to L3 cache and then copied
to SLM for reuse.
CARM data traffic is three times bigger than the SLM traffic. The
reason is not clear from the result, but it is a known effect that
happens due to EU data port buffering data brought from SLM and
accessed sequentially. This effect is positive and implies data reuse
on the memory level closest to EUs.
Let us review data in the GPU Detail pane to understand changes in
performance for this algorithm implementation:
As the OP/S and Bandwidth pane shows, the L3 and SLM bandwidth are
far from their limits. The kernel performance has increased to 47% of
its theoretical limit of integer operations per second (INTOPS).
As the Roofline Guidance chart shows, the kernel performance is
limited by the Int32 Vector operations, which are the operations that
the compiler used to implement the code. The chart indicates that the
kernel can be optimized to run 2.1x faster.
As the Performance Characteristics pane shows, the EUs are stalled
for 43.6% of execution cycles. As the algorithm is fully
vectorized, there should be other reasons for the EU stalls. By
optimizing the EU performance, you might get the 2.1x performance
improvement indicated in the Roofline Guidance pane.
Big Data Array with Data Reuse for a STREAM Benchmark
The STEAM benchmark is a small application that brings a big chunk of data
from memory and executes basic compute kernels: Copy, Scalar, Add, and
Triad. The number of compute operations per kernel is small or equals to
0, so the kernels are expected to be memory bound. For this reason, we
use it to define data bandwidth limits in a system.
After analyzing the benchmark with the GPU Roofline Insights on the
Intel Processor Graphics code-named Tiger Lake, the Roofline chart shows
four dots that correspond to the benchmark kernels. The dots are located
on the memory-bound side of the chart below the DRAM bandwidth roof.
The Roofline Guidance chart shows that the kernels are GTI Bandwidth
bound, not DRAM bound as the main Roofline chart suggests. The reason
for it is that Intel Advisor cannot measure the bandwidth for data
transferred between DRAM and EU on integrated GPUs due to hardware
The Roofline Guidance suggests you improving cache locality to optimize
performance and get better data reuse. This advice is also applicable to
other cases when we test data bandwidth and compute performance is not a
purpose for optimization.
In the OP/S and Bandwidth pane, review the specific numbers for the
achieved memory bandwidth. Notice that CARM, L3, and GTI stages has
similar achieved bandwidth, so the bottleneck for this benchmark is the
most distant memory interface.
Note here that all stages CARM, L3, and GTI have the
similar effective BW and all 3 memory components are roughly
identical to each other for a given kernel. Having identical roofline
components means that there is no reuse in the cache or register file
and every attempt to fetch the data requires accessing all the way down
to external memory, because no data is cached any time. This (equal
CARM, L3 and External Memory roofline components) is a common indication
of “streaming” pattern.
In given case, this also indicates that the most
distant memory interface is the bottleneck for this benchmark. Slight
difference in kernels BW which is still can be observed is due to
Copy/Scale kernels have equal Reads/Writes, while Add/Triad kernels have
twice more Reads then Writes, and Read BW is higher on the system.
To eliminate the hardware limitations of the Intel Processor Graphics
code-named Tiger Lake that do not allow Intel Advisor to measure
bandwidth between DRAM and EU, let us analyze the benchmark running on a
discrete Intel® Iris Xe MAX graphics. The resulting Roofline chart shows
four kernel dots below the DRAM bandwidth roof.
In the OP/S and Bandwidth pane, Intel Advisor now correctly identifies
DRAM as the highest level of bottleneck.
As the OP/S and Bandwidth and Memory Metrics panes show, the DRAM data
traffic is very close to its theoretical limit, and the stream benchmark
really measures the practical limits of the data flow.
Partially Effective Data Access in a Stencil Data Pattern Benchmark
One of the most interesting cases is when data access is compact but in
a very limited local range, while globally, the access is sparse. Such
case is frequent in real-life applications, for example, in a
stencil-based kernel computation where data in two axes, for example, X
and Y, is accessed sequentially in the memory space, but data in Z axis
is accessed in a big unit stride.
Let us analyze a 504.polbm applications from the SPEC ACCEL benchmark
set running on the Intel Processor Graphics with Gen12 architecture.
This benchmark application is written on C with OpenMP* offload to GPU.
It works with double-precision numbers, but the Intel Processor Graphics
with Gen12 architecture can only simulate the calculations with integer
data. That is why we examine the Roofline chart for integer operations.
The GPU Roofline chart shows one dot that correspond to the benchmark
kernel. The dot is located between memory and compute roofs, which means
that if GPU parameters are changed (for example, if you run the analysis
for a hardware with a higher memory bandwidth), the kernel might
slightly move from memory bound to compute bound.
As the Roofline Guidance pane shows, the kernel in limited by L3 cache
bandwidth. Intel Advisor also detects low cache line utilization for the
kernel, which is expected from a stencil-based kernel.
In general, to optimize data access in the stencil-based kernels, you
need to apply different techniques that change data layout to use SLM
for data locality and SIMD parallelism per data axis. However, you
cannot change data layout for benchmarks, and all optimizations are done
by the Graphics Compiler.