Intel® Advisor Cookbook

ID 767152
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Analyze a SYCL Application with GPU Roofline

This recipe illustrates how to use Intel® Advisor command line interface (CLI) to run GPU Roofline Insights perspective on a SYCL implementation of the Mandelbrot application running on a graphics processing unit (GPU) and visualize results in command line output, Python* API, graphical user interface (GUI), and Interactive HTML report.

Intel Advisor provides the GPU Roofline Insights perspective to evaluate and improve performance of GPU kernels in SYCL, C++/Fortran with OpenMP* target, Intel® oneAPI Level Zero API (Level Zero), and OpenCL™ applications. Use GPU Roofline Insights perspective to do the following:

  • Evaluate code executed on a GPU to see how close the performance is to the current hardware-imposed ceilings.
  • Detect and prioritize bottlenecks by estimated performance gain and understand their likely causes, such as memory bound vs compute bound.
  • Pinpoint the exact compute peak or memory level (caches, memory, or compute throughput) causing a bottleneck.
  • Identify which optimizations will pay off the most and apply actionable code restructuring recommendations specific to your application.
  • Visualize optimization progress and compare different code versions plotted on a single Roofline chart.

Directions:

  1. Prerequisites.
  2. Run GPU Roofline Insights perspective.
  3. View GPU Roofline results.
  4. Examine the Application Performance on GPU.
  5. Explore detailed GPU metrics with Intel Advisor Python* API.
  6. Alternative steps.

Ingredients

This section lists the hardware and software used to produce the specific results shown in this recipe:

You can download a precollected GPU Roofline report for the SYCL Mandelbrot application to follow this recipe and examine the analysis results.

Prerequisites

  1. Set up environment. For example:
    source /opt/intel/oneapi/setvars.sh
  2. Compile the sample application:
    cd mandelbrot/ && mkdir build && cd build && cmake .. && make
  3. Configure your system to analyze GPU kernels.

Run GPU Roofline Insights Perspective

To collect GPU Roofline data, run the following command line:

advisor --collect=roofline --profile-gpu --project-dir=./adv_gpu_roofline -- ./src/mandelbrot

After you run the command, Intel Advisor collects data both for GPU kernels and CPU loops/functions in your application to plot a Roofline chart.

Continue to view the collected results using one of the methods described below.

NOTE:
If you want to analyze an MPI application or an application with specific limitations, such as collecting floating-point/integer operations and trip counts data for certain application parts only with collection control APIs you should run the Survey and Characterization commands separately. The shortcut command does not support such applications. See Run GPU Roofline Insights Perspective from Command Line for details.

View GPU Roofline Results

Intel Advisor stores the results of analyses with analysis configurations in the adv_gpu_roofline directory specified with --project-dir option. You can visualize the collected results in several output formats.

View Results in CLI

After you run the command, the result summary is printed to the terminal. It contains a summary of metrics for the whole application and for its CPU and GPU parts. The information about top GPU hotspots is displayed as a table with execution time, number of operations per second, number of calls, and execution units-related metrics for each GPU hotspot.

To see metrics for all GPU hotspots, run the following report command with --gpu option:

advisor --report=survey --gpu --project-dir=./adv_gpu_roofline

To view more data columns, add --show-all-columns option:

advisor --report=survey --gpu --show-all-columns --project-dir=./adv_gpu_roofline

Export Results as an Interactive HTML Report

Generate an interactive HTML report that you can share and view in your web browser:

advisor --report=all --project-dir=./adv_gpu_roofline --report-output=./gpu_roofline_report.html

This command creates an interactive HTML report that has the similar structure of results as GUI. The interactive HTML report contains GPU metrics presented in the grid view and plotted on a Roofline chart. The Roofline chart displays data for floating-point operations, integer operations, and all memory levels that are available in adv_gpu_roofline results.

View Results in GUI

The easiest way to view results is to open them on the same machine where they were collected if it has Intel Advisor GUI installed. In this case, you open an existing Intel Advisor result without creating any additional files or reports.

To open a result in the GUI, run the following command:

advisor-gui ./adv_gpu_roofline
NOTE:
If the report does not open, click Show Result on the Welcome page or just open Intel Advisor GUI and select the project in File > Open > Project menu.

Save a Read-Only Result Snapshot (Instead of HTML and GUI Reports)

If you do not have Intel Advisor GUI installed on the target machine, copy your results to a shared drive and open it on another machine or copy the results directly to the client machine.

TIP:
Use read-only snapshots for decreasing the size of copied files.

To create a snapshot and pack it into an archive including sources and binaries, run the following command:

advisor --snapshot --project-dir=./adv_gpu_roofline --pack --cache-sources --cache-binaries -- ./my_snapshot

The snapshot archive named my_snapshot.advixeexpz is saved into the directory specified by the --project-dir option. lt is stored in the adv_gpu_roofline directory.

To open the result snapshot in Intel Advisor GUI, run the following command:

advisor-gui ./my_snapshot

You can open a snapshot from Intel Advisor GUI by selecting the snapshot in the File > Open > Result menu.

TIP:
Download Intel Advisor for macOS* to view the collected results in the GUI.

For more information, see Snapshot.

Examine the Application Performance on GPU

By default, Intel Advisor displays the Summary tab. It shows the general metrics of the whole application, of its GPU and CPU parts. You can also examine the preview Roofline charts for your application.

To get a more detailed per-kernel view, click the compute task of interest in the Top Hotspots pane or click the GPU Roofline Regions tab. This tab contains the GPU Roofline chart on the left side and a detailed overview of a selected kernel with its GPU Source and Assembly views on the right side. In the bottom of GPU Roofline Regions tab, there is a GPU pane containing a grid view with the list of the kernels and related raw collected data (including memory-related data, EU active/stalled/idle data, EU thread occupancy, number of threads). For more information, see Examine Bottlenecks on GPU Roofline Chart.

Right-click the compute task of interest in the GPU table and select View Source or click the Source View tab. It shows the matching of sources with assembly code.

Explore Detailed GPU Metrics with Intel Advisor Python* API

To visualize the already collected GPU Roofline results in CLI, use Python scripts from /opt/intel/oneapi/advisor/latest/pythonapi/examples. The scripts use Intel Advisor Python API to print raw metrics that may be post processed via user custom scripts. Refer to the examples below.

NOTE:
/opt/intel/oneapi/ is the default Intel Advisor installation directory. Replace it with your installation directory if you installed the Intel Advisor to a different location.

Print GPU Roof Values

Run the gpu_roofs.py script sample to check the values of GPU roofs measured during GPU Roofline collection.

advisor-python /opt/intel/oneapi/advisor/latest/pythonapi/examples/gpu_roofs.py ./adv_gpu_roofline

A list of GPU roofs is printed to the terminal similar to the following:

DP Vector FMA Peak 219 GFLOPS
DP Vector Add Peak 110 GFLOPS
SP Vector FMA Peak 872 GFLOPS
SP Vector Add Peak 439 GFLOPS
Int64 Vector Add Peak 110 GFLOPS
Int32 Vector Add Peak 438 GFLOPS
Int16 Vector Add Peak 873 GFLOPS
Int8 Vector Add Peak 432 GFLOPS
SLM Bandwidth 404 GB/s
L3 Bandwidth 346 GB/s
DRAM Bandwidth 32 GB/s
GTI Bandwidth 76 GB/s

View Detailed Per-Kernel Metrics

Run the survey_gpu.py script sample to view the detailed list of metrics for each kernel from the pre-collected GPU profile, such as detailed kernel instruction mix.

advisor-python /opt/intel/oneapi/advisor/latest/pythonapi/examples/survey_gpu.py ./adv_gpu_roofline

A list of GPU metrics is printed to the terminal similar to the following:

============================================================
Main GPU Dataset
============================================================
…
============================================================
…
carm_l3_cache_line_utilization_______________: 1
carm_slm_cache_line_utilization______________: 0
carm_traffic_gb______________________________: 0.105906
computing_task_______________________________: MandelParallel::Evaluate(cl::sycl::queue&)::{lambda(cl::sycl::handler&)@235:14}::operator()(cl::sycl::handler&) const::{lambda()@240:44}
computing_task_average_time__________________: 0.000498183
…
elapsed_time_________________________________: 0.0503165
…
gpu_compute_performance_fp_ai________________: 852.558
gpu_compute_performance_gflop________________: 11.7716
gpu_compute_performance_gflops_______________: 233.951
gpu_compute_performance_gintop_______________: 0.617538
gpu_compute_performance_gintops______________: 12.2731
gpu_compute_performance_gmixop_______________: 12.3891
gpu_compute_performance_gmixops______________: 246.224
gpu_compute_performance_int_ai_______________: 44.7253
gpu_compute_performance_mix_ai_______________: 897.283
gpu_memory_bandwidth_gb_sec__________________: 0.27441
gpu_memory_bandwidth_gb_sec_read_____________: 0.233971
gpu_memory_bandwidth_gb_sec_write____________: 0.0404391
gpu_memory_data_transferred_gb_______________: 0.0138074
gpu_memory_data_transferred_gb_read__________: 0.0117726
gpu_memory_data_transferred_gb_write_________: 0.00203475
…
work_size_global_____________________________: 512 x 512
work_size_local______________________________: 256 x 1
============================================================
Instruction Mix Dataset
============================================================
zeCommandListAppendMemoryCopyRegion: 0
============================================================
zeCommandListAppendBarrier: 2
============================================================
MandelParallel::Evaluate(cl::sycl::queue&)::{lambda(cl::sycl::handler&)@235:14}::operator()(cl::sycl::handler&) const::{lambda()@240:44}: 1
    Type: Size: Op Type   : Callcount          : Exec Count         : Dynamic Count
    INT : 32  : MOVE      : 104,403,397        : 1,488,428,112      : 1,351,879,647
    INT : 32  : BIT       : 3,309,568          : 28,131,328         : 28,131,328
    INT : 32  : BASIC     : 190,604,170        : 509,977,482        : 509,977,482
        :     : OTHER     : 165,063,391        : 2,641,014,256      : 2,505,733,341
        :     : CONTROL   : 335,418,273        : 4,110,086,021      : 2,400,317,722
    FP  : 32  : MOVE      : 13,238,272         : 112,525,312        : 112,525,312
    FP  : 32  : MATH      : 3,309,568          : 3,309,568          : 3,309,568
    FP  : 32  : BASIC     : 165,999,156        : 2,655,986,496      : 2,385,424,666
    FP  : 32  : FMA       : 327,033,960        : 5,232,543,360      : 4,691,419,700
    INT : 64  : BASIC     : 3,309,568          : 26,476,544         : 26,476,544
    INT : 32  : FMA       : 3,309,568          : 26,476,544         : 26,476,544
    INT : 16  : STORE     : 1,654,784          : 26,476,544         : 26,476,544
============================================================

Alternative Steps

Run GPU Roofline Insights Perspective on a Multi-GPU Systems (Instead of Default Configuration)

If your system has more than one GPU device (for example, an integrated GPU and a discrete GPU, or several discrete GPU devices), specify a target GPU to collect profiling data:

  1. Get the list of GPU devices available on your system:
    advisor --help target-gpu

    The output shows the device configuration in the following format: <domain>:<bus>:<device-number>.<function-number>.

  2. Copy the device configuration that you want to analyze. For example, 0:0:2.0 .
  3. Type the following command to the terminal with the --target-gpu option and provide the copied device configuration as its argument to select the GPU of interest:
    advisor --collect=roofline --profile-gpu --target-gpu=0:0:2.0 --project-dir=./adv_gpu_roofline -- ./src/mandelbrot
  4. Run the command.

Run GPU Roofline Insights Perspective for Kernels with Small Execution Time (Instead of Default Configuration)

Accuracy decreases with decreasing the ratio of kernel time to sampling interval. To achieve the best accuracy, ensure that

To avoid inaccurate metrics for kernels with small execution time, use --gpu-sampling-interval=<double> option to decrease the interval (in milliseconds) between GPU samples:

advisor --collect=roofline --profile-gpu --gpu-sampling-interval=0.1 --project-dir=./adv_gpu_roofline -- ./src/mandelbrot