User Guide

Contents

Examine Bottlenecks on GPU Roofline Chart

GPU Roofline Insights
perspective enables you to view your application performance in relation to the maximum capabilities of your hardware plotted on a Roofline chart, which is available in the
GPU Roofline Regions
view.
Example of a GPU Roofline chart

Explore Performance-Limiting Factors

Intel® Advisor
visualizes the maximum compute capacity and maximum memory bandwidth of your hardware on a Roofline chart:
  • Horizontal lines indicate compute capacity limitations preventing kernels from achieving better performance without some form of optimization.
  • Diagonal lines indicate memory bandwidth limitations preventing kernels from achieving better performance without some form of optimization:
    • L3 cache roof
      : Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
    • SLM cache roof
      : Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
    • GTI roof
      : Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
    • DRAM roof
      : Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
    • HBM roof
      : Represents the maximum bandwidth of HBM memory available to your current graphics hardware. Measured using an optimized sequence of load operations iterating over an array that does not fit in
      discrete GPU
      caches.

Explore Kernel Performance at Different Memory Levels

By default,
Intel Advisor
collects data for all memory levels. This enables you to examine each kernel at different cache levels and arithmetic intensities and provides precise insights into which cache level causes the performance bottlenecks.
Configure Memory-Level Roofline Chart
  1. Expand the filter pane in the GPU Roofline chart toolbar.
  2. In the
    Memory Level
    section, select the memory levels you want to see metrics for.
    Select memory levels for a GPU Roofline chart
    By default, GPU Roofline reports data for GTI memory level (for integrated graphics) and HBM/DRAM memory level (for discrete graphics).
  3. Click
    Apply
    .
Interpret Memory-Level GPU Roofline Data
Examine the relationships between the displayed memory levels and highlight the roof that limits your kernel performance the most by double-clicking a dot on the GPU Roofline chart. Labeled dots and/or X marks are displayed, representing memory levels with arithmetic intensity for the selected kernel at the following memory levels:
  • CARM
    : Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
  • L3
    : Data transferred directly between execution units and L3 cache.
  • SLM
    : Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
  • HBM
    : the maximum bandwidth of HBM memory available to your current graphics hardware. The HBM roof is measured using an optimized sequence of load operations iterating over an array that does not fit in
    discrete GPU
    caches.
  • GTI
    : Represents GTI traffic/GPU memory read bandwidth, the accesses between the GPU, chip uncore (LLC), and main memory on
    integrated GPUs
    . Use this to get a sense of external memory traffic.
  • DRAM
    : the maximum bandwidth of DRAM memory available to your current graphics hardware. The DRAM roof is measured using an optimized sequence of load operations iterating over an array that does not fit in GPU caches. This roof represents the maximum bandwidth between the GPU and Chip Uncore (LLC), and Main Memory on
    discrete GPUs
    .
The
vertical distance
between memory dots and their respective roofline shows how much you are limited by a given memory subsystem. If a dot is close to its roof line, it means that the kernel is limited by the bandwidth of this memory level.
The
horizontal distance
between memory dots indicates how efficiently the kernel uses cache. For example, if L3 and GTI dots are very close on the horizontal axis for a single kernel, the kernel uses L3 and GTI similarly. This means that it does not use L3 and GTI efficiently. Improve re-usage of data in the code to improve application performance.
Arithmetic intensity
on the x axis determines the order in which dots are plotted, which can provide some insight into your code's performance. For example, the CARM dot is typically far to the right of the L3 dot, as read/write access by cache lines and CARM traffic is the sum of actual bytes used in operations. To identify room for optimization, check L3 cache line utilization metric for a given kernel. If the L3 cache line is not utilized well enough, check memory access patterns in your kernel to improve its elapsed time.
Ideally, the CARM and the L3 dots should be located close to each other, and the GTI dot should be far to the right from them. In this case, the kernel has good memory access patterns and mostly utilizes the L3 cache. If the kernel utilizes the L3 cache line well, it:
  • Spends less time on transferring data between L3 and CARM memory levels
  • Uses as much data as possible for actual calculations
  • Enhances the elapsed time of the kernel and of the entire application
Double-click a point on the chart to examine how it uses different memory levels and view a projection to the memory level that limits your kernel the most.
Example of a GPU Roofline chart for all memory levels
Review and compare the changes in traffic from one memory level to another to identify the memory hierarchy bottleneck for the kernel and determine optimization steps based on this information.

Identify Hotspots and Estimate Room for Optimization

According to Amdahl’s law, optimizing kernels that take the largest portion of the total program time leads to greater speedups than optimizing kernels that take the smaller portion of the total time. Intel Advisor enables you to identify kernels taking the largest portion of the total time as hotspots. To find the best candidates for optimization, notice the dots on the Roofline chart. The dots on the chart correspond to kernels running on GPU. By default, size and color of the dots represents the following:
  • Small green dots represent kernels with relatively small execution time (0-1 second).
  • Medium-sized yellow dots represent kernels with medium-range execution time (1-20 seconds).
  • Large red dots represent kernels with the largest execution time (20-100 seconds).
You can customize the dot execution time range, size and color in the
Loop Weight Representation
menu, which you can open by clicking the button on the Roofline chart.
The best candidates for optimization are the largest dots (red ones by default) located far below the topmost rooflines because:
  • Their size clearly shows that improving self elapsed time for these kernels has a significant impact on the total time of the program.
  • Their location shows that there is a significant headroom for optimization.
To identify optimization headroom for a specific kernel, highlight the roof that limits the performance of your kernel by double-clicking a dot on the chart. The roofs above a dot represent the restrictions preventing it from achieving a higher performance. A dot cannot exceed the topmost rooflines, as they represent the maximum capabilities of the hardware. The farther a dot is from the topmost roofs, the more room for improvement there is.
Hover over a selected dot to view its projection on the limiting roof and the estimated speedup that can be achieved by optimizing this kernel.

Define If Your Kernel Is Compute or Memory Bound

To define if your selected kernel is compute or memory bound, examine the Roofline chart for the selected kernel with the following data in the
Roofline Guidance
section in the
GPU Details
tab:
  • Guidance on possible optimization steps depending on the factor limiting performance. Click on the bounding factor to expand the hint.
  • Amount of data transferred for each cache memory level.
  • The exact roof that limits the kernel performance. The arrow points to what you should optimize the kernel for and shows the potential speedup after the optimization in the callout.
    If the arrow points to a diagonal line, the kernel is mostly memory bound. If the arrow points to a horizontal line, the kernel is mostly compute bound.
    Intel® Advisor
    displays a compute roof limiting the performance of your kernel based on the instruction mix used.
The chart is plotted for a dominant type of operations in a code (FLOAT or INT) and shows only roofs with cache memory levels, data types, and instructions mix used in the kernel. If there is no FLOP or INTOP in the kernel, the single-kernel Roofline chart is not shown.
For example, in the screenshot below, the kernel is memory bound. Its performance is limited by the L3 Bandwidth because the kernel uses this memory level to transfer the largest amount of data (6.88 GB) compared to other memory levels. If you optimize the memory access patterns in the kernel, it gets up to 5.1x speedup.
Single-kernel GPU Roofline points to the Int32 Vector Add Peack that bounds the kernel performance

Investigate Performance of Kernel Instances

Review and compare the performance of instances of your kernel initialized with different global and local work size. You can do this using the following:
  • In the GPU Roofline chart
    :
    1. Click a dot on a Roofline chart and click the
      +
      button that appears next to the dot. The dot expands into several dots representing the instances of the selected kernel.
    2. Click a dot representing a kernel instance and view details about its global and local work size in the
      GPU Details
      pane.
    3. Hover over dots representing kernel instances to review and compare their performance metrics. Highlight a roofline limiting the performance of a given instance by double-clicking the dot.
  • In the
    GPU
    pane grid:
    1. Expand a source kernel in the Compute Task column.
    2. View the information about the work size of the kernel instances by expanding the
      Work Size
      column in the grid. To view the count of instances of a given global/local size, expand the
      Compute Task Details
      column in the grid and notice the
      Instance Count
      metric.
    3. Compare performance metrics for instances of different global and local size using the grid and the
      GPU Details
      pane.
Selecting a dot on a chart automatically highlights the respective kernel in the grid and vice versa.
You can add the CPU Roofline panes to the main view using the button on the top pane. For details about CPU Roofline data, see
CPU / Memory Roofline Insights
Perspective

Next Steps

Explore detailed information about each kernel and get actionable recommendations for optimization of the kernel code using the GPU Details tab.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.