# Optimize Memory-bound Applications with GPU Roofline

## Memory Path in Intel® GPU Microarchitecture

^{e}Vector Engine (XVE) register file.

*. For example, for a naive implementation of the matrix multiplication algorithm, theoretically, the amount of data read for each matrix and used for calculations is:algorithmic data*

^{3}

- D is a size of
matrix element, in Bytesa - M is a size of
square matrixa

## GPU Roofline Methodology

*, which can analyze performance of CPU applications, andCPU/Memory Roofline Insights*

*, which can analyze performance of GPU applications. General methodology of a Roofline model focused on the CPU/Memory Roofline is explained in the resources listed in the Roofline Resources for Intel® Advisor Users. You are strongly recommended to learn the Roofline basics before continuing. This recipe explores features of the GPU Roofline Insights perspective for analyzing performance on Intel GPUs.GPU Roofline Insights*

**Roofline Result Overview**

- Open the analysis results and examine a GPU Roofline chart reported. It plots an application’s achieved performance and arithmetic intensity against the machine’s maximum achievable performance.For example, for the matrix multiply application, the GPU Roofline chart filtered by GTI (memory) level has one red dot representing a GPU kernel.

- If there is room for improvement to speed up kernel performance on the current GPU
- What the kernel is bound by: compute, cache, or memory bandwidth, and what you can change in the algorithm implementation to go beyond those boundaries to increase performance

**Kernel Location Calculation**

^{3}/ 3*M

^{2}

- M is a size of a square matrix
- M
^{3}is the number of operations - 3*M
^{2}is the amount of read/write data

^{3}/ T

- T is time it takes for the operations to complete
- M
^{3}is the number of operations

- I’ is measured number of executed computing instructions
- T’ is measured time

_{XX}at each memory stage. Assuming the algorithm is memory bound, at some levels, the data flow should be close to hardware bandwidth, while at other levels, it can be less limited. To identify the most probable bottleneck of the algorithm implementation, you need to find out which dot is the closest to its corresponding memory level roof line. Note that data flows may have more than one bottleneck, and the distance between dots and their corresponding roof lines should be similar.

## Performance Optimization Scenarios using GPU Roofline

- Auxiliary data transferred between memory and VEs, such as indexes of work-group item calculations, memory addresses arithmetic, loop indexes
- Data caching being more complicated as it is affected by the auxiliary data
- VE thread scheduling affecting data management

**Data Block Optimized for the Matrix Multiply Algorithm with no SLM Usage**

- The kernel is memory bound, and the corresponding dot it close to L3 Bandwidth roof.
- GTI data traffic is four times smaller than the L3 cache data traffic, which indicates high data reuse.
- CARM and L3 traffics are almost the same, which indirectly indicates 100% of cache line usage because cache lines are fully filled with algorithmic data.

**pane grid, which is 100%. The grid also reports VE Threading Occupancy of 98.1%, which indicates good scheduling and data distribution among threads.GPU**

The Roofline Guidance pane shows kernel limitation summary and

provides estimation for possible performance speedup after

optimization. For the matrix multiply kernel, the main limitation

is the L3 cache bandwidth. The kernel can run 1.4x faster if it

uses the L3 cache more effectively and reaches its maximum

bandwidth with the same arithmetic intensity, but a better data

access pattern.

The Memory Metrics pane can help you understand memory level

impact, which is time spent in requests from different memory

levels, in per cent to the total time. For this kernel, GTI has

less impact that L3 cache, but it is still taking a big part of

total kernel execution time and may become a bottleneck after the

L3 bandwidth limits are eliminated, for example, using SLM.

Shares metric is a visual way of estimating data portions processed

from different memory levels. In this case, L3 cache has 4x more

data than GTI.

- The OP/S and Bandwidth pane shows the number of measured operations per second and data traffic in relation to the bandwidth limitations. For this kernel, the summary reports the following data:
- The theoretical SLM bandwidth is almost 3x times higher than the L3 cache bandwidth, but the SLM is not used in this implementation. Blocking matrix arrays to use them as local shared data can eliminate the L3 cache bandwidth limits.
- The kernel performance is only 27% of theoretically possible peak performance for Int32 data. With better memory access implementation, we could potentially reach 3x performance increase for this kernel.

**Data Block Optimized Matrix Multiply Algorithm with SLM**

^{e}-core level.

- SLM traffic is much bigger than L3 traffic. L3 traffic is not zero, which is expected as data blocks are read to L3 cache and then copied to SLM for reuse.
- CARM data traffic is three times bigger than the SLM traffic. The reason is not clear from the result, but it is a known effect that happens due to VE data port buffering data brought from SLM and accessed sequentially. This effect is positive and implies data reuse on the memory level closest to VEs.

- As the OP/S and Bandwidth pane shows, the L3 and SLM bandwidth are far from their limits. The kernel performance has increased to 47% of its theoretical limit of integer operations per second (INTOPS).
- As the Roofline Guidance chart shows, the kernel performance is limited by the Int32 Vector operations, which are the operations that the compiler used to implement the code. The chart indicates that the kernel can be optimized to run 2.1x faster.

As the Performance Characteristics pane shows, the VEs are stalled

for 43.6% of execution cycles. As the algorithm is fully

vectorized, there should be other reasons for the VE stalls. By

optimizing the VE performance, you might get the 2.1x performance

improvement indicated in the Roofline Guidance pane.

You can run the GPU Compute/Media Hotspots analysis

of the Intel® VTune™ GPU Hotspot analysis to investigate reasons

for the VE stalls further.

**Big Data Array with Data Reuse for a STREAM Benchmark**

**Partially Effective Data Access in a Stencil Data Pattern Benchmark**