Examine Bottlenecks on GPU Roofline Chart
Explore Performance-Limiting Factors
- Horizontal lines indicate compute capacity limitations preventing kernels from achieving better performance without some form of optimization.
- Diagonal lines indicate memory bandwidth limitations preventing kernels from achieving better performance without some form of optimization:
- L3 cache roof: Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
- SLM cache roof: Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
- GTI roof: Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
- DRAM roof: Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
- HBM roof: Represents the maximum bandwidth of HBM memory available to your current graphics hardware. Measured using an optimized sequence of load operations iterating over an array that does not fit indiscrete GPUcaches.
Explore Kernel Performance at Different Memory Levels
- Expand the filter pane in the GPU Roofline chart toolbar.
- In theMemory Levelsection, select the memory levels you want to see metrics for.By default, GPU Roofline reports data for GTI memory level (for integrated graphics) and HBM/DRAM memory level (for discrete graphics).
- CARM: Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
- L3: Data transferred directly between execution units and L3 cache.
- SLM: Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
- HBM: the maximum bandwidth of HBM memory available to your current graphics hardware. The HBM roof is measured using an optimized sequence of load operations iterating over an array that does not fit indiscrete GPUcaches.
- GTI: Represents GTI traffic/GPU memory read bandwidth, the accesses between the GPU, chip uncore (LLC), and main memory onintegrated GPUs. Use this to get a sense of external memory traffic.
- DRAM: the maximum bandwidth of DRAM memory available to your current graphics hardware. The DRAM roof is measured using an optimized sequence of load operations iterating over an array that does not fit in GPU caches. This roof represents the maximum bandwidth between the GPU and Chip Uncore (LLC), and Main Memory ondiscrete GPUs.
- Spends less time on transferring data between L3 and CARM memory levels
- Uses as much data as possible for actual calculations
- Enhances the elapsed time of the kernel and of the entire application
Identify Hotspots and Estimate Room for Optimization
- Small green dots represent kernels with relatively small execution time (0-1 second).
- Medium-sized yellow dots represent kernels with medium-range execution time (1-20 seconds).
- Large red dots represent kernels with the largest execution time (20-100 seconds).
- Their size clearly shows that improving self elapsed time for these kernels has a significant impact on the total time of the program.
- Their location shows that there is a significant headroom for optimization.
Define If Your Kernel Is Compute or Memory Bound
- Guidance on possible optimization steps depending on the factor limiting performance. Click on the bounding factor to expand the hint.
- Amount of data transferred for each cache memory level.
- The exact roof that limits the kernel performance. The arrow points to what you should optimize the kernel for and shows the potential speedup after the optimization in the callout.If the arrow points to a diagonal line, the kernel is mostly memory bound. If the arrow points to a horizontal line, the kernel is mostly compute bound.Intel® Advisordisplays a compute roof limiting the performance of your kernel based on the instruction mix used.
Investigate Performance of Kernel Instances
- In the GPU Roofline chart:
- Click a dot on a Roofline chart and click the+button that appears next to the dot. The dot expands into several dots representing the instances of the selected kernel.
- Click a dot representing a kernel instance and view details about its global and local work size in theGPU Detailspane.
- Hover over dots representing kernel instances to review and compare their performance metrics. Highlight a roofline limiting the performance of a given instance by double-clicking the dot.
- In theGPUpane grid:
- Expand a source kernel in the Compute Task column.
- View the information about the work size of the kernel instances by expanding theWork Sizecolumn in the grid. To view the count of instances of a given global/local size, expand theCompute Task Detailscolumn in the grid and notice theInstance Countmetric.
- Compare performance metrics for instances of different global and local size using the grid and theGPU Detailspane.