Intel® Advisor User Guide

ID 766448
Date 10/31/2024
Public
Document Table of Contents

Examine Bottlenecks on GPU Roofline Chart

GPU Roofline Insights perspective enables you to view your application performance in relation to the maximum capabilities of your hardware plotted on a Roofline chart, which is available in the GPU Roofline Regions view.

NOTE:
Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see GPU Architecture Terminology for Intel® Xe Graphics.

Explore Performance-Limiting Factors

Intel® Advisor visualizes the maximum compute capacity and maximum memory bandwidth of your hardware on a Roofline chart:

  • Horizontal lines indicate compute capacity limitations preventing kernels from achieving better performance without some form of optimization.
  • Diagonal lines indicate memory bandwidth limitations preventing kernels from achieving better performance without some form of optimization:
    • L3 cache roof: Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
    • SLM cache roof: Shared local memory (SLM). Represents the maximal bandwidth of the SLM for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
    • GTI roof: Graphics technology interface (GTI). Represents the maximum bandwidth between the GPU and the rest of the system on a chip (SoC). This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
    • DRAM roof: Dynamic random-access memory (DRAM). Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
    • HBM roof: High bandwidth memory (HBM). Represents the maximum bandwidth of HBM memory available to your current graphics hardware. Measured using an optimized sequence of load operations iterating over an array that does not fit in discrete GPU caches.

Identify Hotspots and Estimate Room for Optimization

According to Amdahl’s law, optimizing kernels that take the largest portion of the total program time leads to greater speedups than optimizing kernels that take the smaller portion of the total time. Intel Advisor enables you to identify kernels taking the largest portion of the total time as hotspots. To find the best candidates for optimization, notice the dots on the Roofline chart. The dots on the chart correspond to kernels running on GPU. Size and color of the dots depends on a dot, or point weight, which is the percentage of the dot time to the program total time and is calculated as dot self-elapsed time / program total elapsed time * 100. By default, the size and color of dots is the following:

  • Small green dots represent kernels with relatively small execution time (0-1% of the total time).
  • Medium-sized yellow dots represent kernels with medium-range execution time (1-20% of the total time).
  • Large red dots represent kernels with the largest execution time (20-100% of the total time).

NOTE:
To customize the dot execution time range, size and color, click the button on the Roofline chart to open the Loop Weight Representation menu.

The best candidates for optimization are the largest dots (red ones by default) located far below the topmost rooflines because:

  • Their size clearly shows that improving self elapsed time for these kernels has a significant impact on the total time of the program.
  • Their location shows that there is a significant headroom for optimization.

To identify optimization headroom for a specific kernel, double-click a dot on the chart to highlight the roof that limits its performance. The roofs above the dot represent the restrictions preventing it from achieving a higher performance. The dot cannot exceed the topmost rooflines, as they represent the maximum capabilities of the hardware. The farther the dot is from the topmost roofs, the more room for improvement there is.

Hover over the selected dot to view its projection on the limiting roof and the estimated speedup that can be achieved by optimizing this kernel.

Similar approach is used for multi-tile GPUs, with the Roofline chart depicting each GPU tile. For example, in case of a multi-tile GPU with two tiles, there are two dots in the Roofline chart (one dot per tile). If the tiles perform equally, the dots can be in the same place on the chart, or very close to each other. If there is a distance between the dots, consider the following:

  • A dot on the left indicates that this tile is experiencing memory bandwidth limitations.
  • A dot on the right indicates that this tile is experiencing compute capacity limitations.

To view the details on each tile, expand the hotspot. You can, for example, switch to the Source and Assembly view and examine the detailed information for the GPU tile and GPU device.

Using this analysis, you may want to correct the unbalanced operation and have all GPU tiles in the central zone, indicating they are performing in a more efficient way.

Explore Kernel Performance at Different Memory Levels

By default, Intel Advisor collects data for all memory levels. This enables you to examine each kernel at different cache levels and arithmetic intensities and provides precise insights into which cache level causes the performance bottlenecks.

Configure Memory-Level Roofline Chart

  1. Expand the filter pane in the GPU Roofline chart toolbar.
  2. In the Memory Level section, select the memory levels you want to see metrics for.

    NOTE:
    By default, GPU Roofline reports data for GTI memory level (for integrated graphics) and HBM/DRAM memory level (for discrete graphics).
  3. Click Apply.

Interpret Memory-Level GPU Roofline Data

Double-click a dot on the chart to review and compare the changes in traffic between the memory levels displayed, identify a memory hierarchy bottleneck, and highlight the roof that limits your kernel performance the most. You can use this information to determine optimization steps. Labeled dots and/or X marks are displayed, representing memory levels with arithmetic intensity for the selected kernel at the following memory levels:

  • CARM: Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
  • L3: Data transferred directly between execution units and L3 cache.
  • SLM: Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
  • HBM: the maximum bandwidth of HBM memory available to your current graphics hardware. The HBM roof is measured using an optimized sequence of load operations iterating over an array that does not fit in discrete GPU caches.
  • GTI: GPU memory read bandwidth, which is the accesses between the GPU, chip uncore (LLC), and main memory on integrated GPUs. Use this to understand external memory traffic.
  • DRAM: Maximum DRAM memory bandwidth available to your current GPU. The DRAM roof is measured using an optimized sequence of load operations iterating over an array that does not fit in GPU caches. This roof represents the maximum bandwidth between the GPU, chip uncore (LLC), and main memory on discrete GPUs.