Using the Command-Line Interface to Analyze the Performance of a SYCL* Application running on a GPU (NEW)
This recipe illustrates how you use the command-line interface (CLI) in
Intel® VTune™
to analyze the performance of a SYCL application offloaded on an Intel GPU. The recipe also describes how you can customize your report with collected data.
Profiler
Intel® VTune™
provides a command line interface (the
Profiler
vtune
tool) for remote analysis, scripted commands, and performance regression checks to monitor software performance over time. The
vtune
command line interface (CLI) provides an extensive set of options with which you can perform almost every task that is possible through the GUI. You can initiate analysis via the command line (running it as a background task or on a remote system) and then view the result or generate a report.
This recipe explores how you can use the CLI efficiently to generate reports on hotspots for these purposes:
- Explore hotspots on the CPU/GPU side by runninggpu-offloadandgpu-hotspotsanalyses.
- View the hottest GPU computing tasks annotated with:
- Execution time
- Data transfers
- Working group sizes
- SIMD width
- Average GPU hardware metrics
- Generate Source/Assembly code views to analyze instructions that possibly contributed to performance issues.
Here are the ingredients and instructions you need to explore efficient CLI use for GPU performance analysis.
Ingredients
Here are the minimum hardware and software requirements for this performance analysis.
- Application: . This sample application is available as part of the code sample package for Intel® oneAPI toolkits.
- Compiler: To compile a SYCL application, you need theIntel® oneAPI(DPC++/C++Compilericpx -fsycl) that is available with the Intel® oneAPI Base Toolkit.
- Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed toIntel® VTune™.Profiler
- Most recipes in theIntel® VTune™Performance Analysis Cookbook are flexible. You can apply them to different versions ofProfilerIntel® VTune™. In some cases, minor adjustments may be required.Profiler
- Get the latest version ofIntel® VTune™:Profiler
- From theIntel® VTune™product page.Profiler
- Download the latest standalone package from the Intel® oneAPI standalone components page.
- Microarchitecture:
- Intel® Iris® Pro Graphics 580
- Intel microarchitecture codenamed Skylake S
- Operating system:
- Ubuntu 20.04 LTS
Build and Compile the SYCL Application
- Go to the sample directory.cd <sample_dir>/VtuneProfiler/matrix_multiply_vtune
- Themultiply.cppfile in thesrcdirectory contains several versions of matrix multiplication. Select a version by editing the corresponding#define MULTIPLYline inmultiply.hpp.
- Compile your sample application:cmake . && makeThis command generates amatrix.icpx -fsyclexecutable.To delete the program, type:make cleanThis command removes the executable and object files that were created by themakecommand.
Ensure Prerequisites for GPU Analyses
Complete these steps before you run the
GPU Offload Analysis
or the
GPU Compute/Media Hotspots Analysis
.
- Prepare the system to run a GPU analysis. See Set Up System for GPU Analysis.
- Set up environment variables for Intel software tools:source $ONEAPI_ROOT/setvars.sh
Run GPU Offload Analysis on the SYCL Application
Use the
GPU Offload Analysis
as a starting point to identify if an application is CPU or GPU bound. Explore GPU offload efficiency through data transfer analysis and find performance-critical kernels for further analysis and optimization.
Run GPU Offload Analysis
In the CLI, type:
By default,vtune -collect gpu-offload -r ./result_gpu-offload -- ./matrix.icpx -fsycl
VTune
generates a summary report after collecting data. This report includes information on the following fields:
Profiler
- Elapsed time
- GPU utilization information
- Information about the hottest computing tasks
- Recommendations
To see the summary report, type:
vtune -report summary -r ./result_gpu-offload
If you do not need to see the summary report immediately after data collection, change this setting with the
-no-summary
option:
vtune -collect gpu-offload -no-summary -r ./result_gpu-offload -- ./matrix.icpx -fsycl

Families of Intel® X
e
graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe
Graphics.
Generate Additional Reports to View Collected Data
- CPU Hotspots ReportThis report displays a list of executed functions with CPU Time metrics, module names, source file paths and other parameters. The report also lists the hottest program units, starting with the most performance-critical unit. Use the-column,-filter, and-limitoptions to sort data into a tabular view:vtune -report hotspots -r ./result_gpu-offload
- CPU Hotspots Report Filtered by Module and Grouped by Functionvtune -report hotspots -r ./result_gpu-offload -group-by=function -filter module=matrix.icpx -fsycl -qYou can group the generated data in several ways like function name, module, source file path, or computing task.To see available groupings for a specific result, type:vtune -report hotspots -r ./result_gpu-offload -group-by=?
- CPU Hotspots Report Sorted by Ordervtune -report hotspots -r result_gpu-offload -group-by module -sort-desc="CPU Time:Execution" -qHere is another example:vtune -report hotspots -r result_gpu-offload -group-by module -sort-asc="CPU Time:Idle" -qTo see available columns for a specific result, type:vtune -report hotspots -r ./result_gpu-offload -column=?The report data can contain such columns asCPU Time:Self,Module, andSource File.
- Report of Top 'n' Time-Intensive Program ModulesUse the limit option to see information about the top 'n' hotspots. For example, to understand details about the top five time-intensive program modules in your application, type:vtune -report hotspots -r result_gpu-offload -group-by module -sort-desc="CPU Time" -limit=5 -q
- Hotspots Report Grouped by Computing Task (offloaded on GPU) with Transfer ColumnsThis command displays hotspots information grouped by GPU computing task and also lists details about transfer sizes and transfer times between CPU and GPU:vtune -report hotspots -r ./result_gpu-offload -group-by=computing-task -column=Transfer -qThe report contains data transfers that are attributed to the respective computing task.
- Hotspots Report Grouped by GPU Offload Computing Task and Time ColumnsThis command displays hotspots information grouped by offload computing tasks and also lists details about transfer times between CPU and GPU:vtune -report hotspots -r ./result_gpu-offload -group-by=computing-task-offload -column='Time' -q
Run GPU Compute/Media Hotspots Analysis
Our next step is to run the
GPU Compute/Media Hotspots
analysis. This analysis can help us to further explore performance improvements for the GPU-bound application or its stages.
Run GPU Compute/Media Hotspots Analysis
In the CLI, type this command to run the analysis:
vtune -collect gpu-hotspots -r ./result_gpu-hotspots -- ./matrix.icpx -fsycl
To see the summary report, type:
vtune -report summary -r ./result_gpu-hotspots

Generate Report to View Computing Tasks with L3 Metrics
Use this command to generate a report that lists only L3 metrics for computing tasks:
vtune -report hotspots -r result_gpu-hotspots -group-by=computing-task -column='L3' -q

Run GPU Compute/Media Hotspots Analysis with Dynamic Instruction Count and SIMD Utilization
Run the GPU Compute/Media Hotspots Analysis in the Characterization mode to collect data on dynamic instruction count and SIMD utilization:
vtune -collect gpu-hotspots -knob characterization-mode=instruction-count -r ./result_gpu-hotspots_inst-count -- ./matrix.icpx -fsycl
Generate Reports to View Source and Assembly Metrics
- Source Code for Specific Computing TasksUse this command to get the source code for a specific computing task:vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=gpu-source-line -column="Source","GPU Instructions Executed:Int32 & SP Float" -q
- Assembly Code for Specific Computing TasksUse this command to get the assembly code for a specific computing task:vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=address -limit=5 -q
- Save Report as CSV FileUse the option to save the generated report as a file. To specify the generation of a.csvreport, use and options:vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=address -limit=5 -report-output=result.csv -format=csv -csv-delimiter=comma -q
Run Custom Analysis with GPU Programming API Statistics
To get a focused analysis of timing and statistics related to GPU compute kernels, follow the GPU Compute/Media Hotspots analysis with a custom analysis that collects
GPU Programming API
statistics.
The kernel data available through this collection is similar to the data you collect when running the
CLIntercept
tool (with
DevicePerformanceTiming option enabled) and with the
nvprof
tool in Summary mode.
Collect GPU Programming API Statistics
In the command line, type:
vtune -collect-with runss -knob collect-programming-api=true -no-summary -r ./result_gpu-programming-api -- ./matrix.icpx -fsycl
Generate Report to View Timing and Statistics for GPU Compute Kernels
This command generates a report that lists timings and instance count for computing tasks. The data is sorted by
Total Time
in descending order.
vtune -report hotspots -group-by=source-computing-task -column="Total Time,Average Time,Instance Count" -sort-desc="Total Time" -r ./result_gpu-programming-api/ -q

Discuss this recipe in the
VTune
developer forum.
Profiler