Get Started Guide

  • 2022.0
  • 12/06/2021
  • Public Content

Measure GPU Performance Using GPU Roofline

GPU Roofline Insights
perspective enables you to estimate and visualize actual performance of GPU kernels using benchmarks and hardware metric profiling against hardware-imposed performance ceilings, as well as determine the main limiting factor.
There are two ways to run
GPU Roofline Insights
perspective: from the
Intel® Advisor
GUI and from CLI.
Intel Advisor
enables you to open results collected using both methods in the GUI.
GPU Roofline Insights
Perspective from
Intel® Advisor
In the
Analysis Workflow
pane, use a drop-down menu to select the
GPU Roofline Insights
perspective, set data collection accuracy level to
, and click the button to run it. At this accuracy level,
Intel Advisor
  • Measures the hardware limitations and collects OpenCL™, OpenMP*, oneAPI Level Zero (Level Zero) and Data Parallel C++ (DPC++) kernels timings and memory data using the
    Survey analysis
    with GPU profiling.
  • Collects floating-point and integer operations data using the
    Trip Counts and FLOP analysis
    with GPU profiling.
For details about data collection accuracy presets, see
Intel Advisor
User Guide: GPU Roofline Accuracy Presets
. Upon completion,
Intel Advisor
displays a
GPU Roofline Summary
. Switch to the
GPU Roofline Regions
tab to view the
Roofline Chart
and identify the main factors limiting the performance of your application.
GPU profiling is applicable only to Intel® Processor Graphics.
chart plots an application's
achieved performance
arithmetic intensity
against the machine's
maximum achievable performance
  • Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPS) per byte for FLOAT Roofline chart and in number of integer operations (INTOPS) per byte for INT Roofline chart based on the kernel algorithm, transferred between GPU and memory
  • Performance (y axis) - measured in billions of floating-point operations (GFLOPS) per second for FLOAT Roofline chart and in billions of integer operations (GINTOPS) per second for INT Roofline chart
In general:
  • Dots
    represent kernels. The size and color of each dot represent relative execution time for each kernel. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth optimizing.
  • Diagonal lines
    memory bandwidth limitations
    preventing kernels from achieving better performance without some form of optimization.
    Depending on your system configuration the following rooflines might be available on the Roofline chart:
    • L3 cache roof
      : Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
    • SLM cache roof
      : Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
    • GTI roof
      : Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
    • DRAM roof
      : Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
  • Horizontal lines
    compute capacity limitations
    preventing kernels from achieving better performance without some form of optimization.
  • A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine. However, not all kernels can utilize maximum machine capabilities.
  • The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.
The GPU Roofline chart is based on a CPU Roofline chart layout, but there are some differences:
  • The dots on the chart correspond to
    OpenCL, OpenMP, Level Zero and DPC++ kernels
    , while in the CPU version, they correspond to individual loops.
  • Some displayed information and controls (for example, thread/core count) are not relevant to GPU Roofline. For more information, see the table below.
  • The GPU Roofline chart enables you to view arithmetic intensity of one kernel at multiple memory levels. To do so, double-click a dot representing this kernel or select it and press ENTER. The dots that appear on the Roofline chart correspond to different memory levels used to calculate arithmetic intensity. Hover over a dot to identify its arithmetic intensity. To show or hide certain dots from a chart, use the
    Memory Level
    drop-down filter.
Intel Advisor GPU Roofline Chart
GPU Roofline Insights
Perspective from Command Line Interface
To run
GPU Roofline Insights
perspective using
command line interface, use the following command:
advisor --collect=roofline --profile-gpu --project-dir=./advi --search-dir src:p=./advi –- myApplication
  1. Collect performance metrics for loops/functions of your application using Survey analysis:
    advisor --collect=survey --profile-gpu --project-dir=./advi --search-dir src:p=./advi –- myApplication
  2. Collect floating-point operations data using Characterization analysis:
    advisor --collect=tripcounts --no-trip-counts --flop --profile-gpu --project-dir=./advi --search-dir src:p=./advi –- myApplication
    • no-trip-counts
      disables collection of trip counts during Characterization analysis.
    • flop
      enables collection of data about floating-point and integer operations, memory traffic, and mask utilization metrics for AVX-512 platforms during Characterization analysis.
This command is a batch mode that runs two analyses one by one:
  1. Survey analysis that collects loops/functions execution time data and measure L3, SLM, and GTI traffic.
  2. Characterization analysis that collects floating-point and integer operations considering mask utilization, and CARM memory traffic to measure arithmetic intensity and performance of your application.
To view the achieved performance of your application against hardware-imposed performance ceilings on an interactive Roofline chart, open the collected results in the
Intel Advisor
GUI or use the following command to generate an interactive HTML Roofline report:
advisor --report=roofline --profile-gpu --report-output=./advi/advisor-roofline.html --project-dir=./advi
option specifies the directory and the HTML file into which
Intel Advisor
saves the generated report.
By default,
Intel Advisor
generates a FLOAT Roofline chart. To switch to INT Roofline chart, add a
option to your command.
For details about generating CLI reports, see the respective section in the
Intel Advisor
User Guide
or use the following command in your terminal:
advisor --help report
Intel Advisor enables you to create a read-only result snapshot using the following command:
advisor --snapshot --project-dir=./advi --pack --cache-sources --cache-binaries -- /tmp/my_proj_snapshot
What's Next
Use the
GPU Roofline Summary
(available in GUI only) to compare performance of your application on a CPU and on a GPU device.
Investigate performance metrics for your kernels and recommendations with possible optimization steps in the
GPU Code Analytics
See Also
Explore a use case for optimizing GPU usage described in
Intel Advisor
Cookbook: Identify Code Regions to Offload to GPU and Visualize GPU Usage

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at