Measure GPU Performance Using GPU Roofline

Get Started with Intel® Advisor

Download PDF

ID 766450

Date 1/18/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Measure GPU Performance Using GPU Roofline

GPU Roofline Insights perspective enables you to estimate and visualize actual performance of GPU kernels using benchmarks and hardware metric profiling against hardware-imposed performance ceilings, as well as determine the main limiting factor.

There are two ways to run GPU Roofline Insights perspective: from the Intel® Advisor GUI and from CLI. Intel Advisor enables you to open results collected using both methods in the GUI.

Run GPU Roofline Insights Perspective from Intel® Advisor GUI

In the Analysis Workflow pane, use a drop-down menu to select the GPU Roofline Insights perspective, set data collection accuracy level to Low, and click the button to run it. At this accuracy level, Intel Advisor:

Measures the hardware limitations and collects OpenCL™, OpenMP*, oneAPI Level Zero (Level Zero) and SYCL kernels timings and memory data using the Survey analysis with GPU profiling.
Collects floating-point and integer operations data using the Trip Counts and FLOP analysis with GPU profiling.

For details about data collection accuracy presets, see Intel Advisor User Guide: GPU Roofline Accuracy Presets. Upon completion, Intel Advisor displays a GPU Roofline Summary. Switch to the GPU Roofline Regions tab to view the Roofline Chart and identify the main factors limiting the performance of your application.

IMPORTANT:

GPU profiling is applicable only to Intel® Processor Graphics.

A Roofline chart plots an application's achieved performance and arithmetic intensity against the machine's maximum achievable performance:

Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPS) per byte for FLOAT Roofline chart and in number of integer operations (INTOPS) per byte for INT Roofline chart based on the kernel algorithm, transferred between GPU and memory
Performance (y axis) - measured in billions of floating-point operations (GFLOPS) per second for FLOAT Roofline chart and in billions of integer operations (GINTOPS) per second for INT Roofline chart

In general:

Dots represent kernels. The size and color of each dot represent relative execution time for each kernel. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth optimizing.
Diagonal lines indicate memory bandwidth limitations preventing kernels from achieving better performance without some form of optimization.
Depending on your system configuration the following rooflines might be available on the Roofline chart:
- L3 cache roof: Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
- SLM cache roof: Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
- GTI roof: Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
- DRAM roof: Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
Horizontal lines indicate compute capacity limitations preventing kernels from achieving better performance without some form of optimization.
A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine. However, not all kernels can utilize maximum machine capabilities.
The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.

The GPU Roofline chart is based on a CPU Roofline chart layout, but there are some differences:

The dots on the chart correspond to OpenCL, OpenMP, Level Zero and SYCL kernels, while in the CPU version, they correspond to individual loops.
Some displayed information and controls (for example, thread/core count) are not relevant to GPU Roofline. For more information, see the table below.
The GPU Roofline chart enables you to view arithmetic intensity of one kernel at multiple memory levels. To do so, double-click a dot representing this kernel or select it and press ENTER. The dots that appear on the Roofline chart correspond to different memory levels used to calculate arithmetic intensity. Hover over a dot to identify its arithmetic intensity. To show or hide certain dots from a chart, use the Memory Level drop-down filter.

Run GPU Roofline Insights Perspective from Command Line Interface

To run GPU Roofline Insights perspective using advisor command line interface, use the following command:

advisor --collect=roofline --profile-gpu --project-dir=./advi  --search-dir src:p=./advi –- myApplication

Collect performance metrics for loops/functions of your application using Survey analysis:

advisor --collect=survey --profile-gpu --project-dir=./advi  --search-dir src:p=./advi –- myApplication

Collect floating-point operations data using Characterization analysis:
```
advisor --collect=tripcounts --no-trip-counts --flop --profile-gpu --project-dir=./advi --search-dir src:p=./advi –- myApplication
```
Where:
- no-trip-counts disables collection of trip counts during Characterization analysis.
- flop enables collection of data about floating-point and integer operations, memory traffic, and mask utilization metrics for AVX-512 platforms during Characterization analysis.

This command is a batch mode that runs two analyses one by one:

Survey analysis that collects loops/functions execution time data and measure L3, SLM, and GTI traffic.
Characterization analysis that collects floating-point and integer operations considering mask utilization, and CARM memory traffic to measure arithmetic intensity and performance of your application.

To view the achieved performance of your application against hardware-imposed performance ceilings on an interactive Roofline chart, open the collected results in the Intel Advisor GUI or use the following command to generate an interactive HTML Roofline report:

advisor --report=roofline --profile-gpu --report-output=./advi/advisor-roofline.html --project-dir=./advi

Where report-output option specifies the directory and the HTML file into which Intel Advisor saves the generated report.

By default, Intel Advisor generates a FLOAT Roofline chart. To switch to INT Roofline chart, add a –-data-type=int option to your command.

For details about generating CLI reports, see the respective section in the Intel Advisor User Guide or use the following command in your terminal:

advisor --help report

Intel Advisor enables you to create a read-only result snapshot using the following command:

advisor --snapshot --project-dir=./advi --pack --cache-sources --cache-binaries -- /tmp/my_proj_snapshot

What's Next

Use the GPU Roofline Summary (available in GUI only) to compare performance of your application on a CPU and on a GPU device.

Investigate performance metrics for your kernels and recommendations with possible optimization steps in the GPU Code Analytics pane.

See Also

Explore a use case for optimizing GPU usage described in Intel Advisor Cookbook: Identify Code Regions to Offload to GPU and Visualize GPU Usage.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Get Started with Intel® Advisor

Measure GPU Performance Using GPU Roofline