Cookbook

  • 2021
  • 11/09/2021
  • Public Content
Contents

Using the Command-Line Interface to Analyze the Performance of a DPC++ Application running on a GPU (NEW)

This recipe illustrates how you use the command-line interface (CLI) in
Intel® VTune™
Profiler
to analyze the performance of a Data Parallel C++ (DPC++) application offloaded on an Intel GPU. The recipe also describes how you can customize your report with collected data.
Content expert
: Egor Suldin, Mariya Petrova
Intel® VTune™
Profiler
provides a command line interface (the
vtune
tool) for remote analysis, scripted commands, and performance regression checks to monitor software performance over time. The
vtune
command line interface (CLI) provides an extensive set of options with which you can perform almost every task that is possible through the GUI. You can initiate analysis via the command line (running it as a background task or on a remote system) and then view the result or generate a report.
This recipe explores how you can use the CLI efficiently to generate reports on hotspots for these purposes:
  • Explore hotspots on the CPU/GPU side by running
    gpu-offload
    and
    gpu-hotspots
    analyses.
  • View the hottest GPU computing tasks annotated with:
    • Execution time
    • Data transfers
    • Working group sizes
    • SIMD width
    • Average GPU hardware metrics
  • Generate Source/Assembly code views to analyze instructions that possibly contributed to performance issues.
Here are the ingredients and instructions you need to explore efficient CLI use for GPU performance analysis.

Ingredients

Here are the minimum hardware and software requirements for this performance analysis.

Build and Compile the DPC++ Application

  1. Go to the sample directory.
    cd <sample_dir>/VtuneProfiler/matrix_multiply_vtune
  2. The
    multiply.cpp
    file in the
    src
    directory contains several DPC++ versions of matrix multiplication. Select a version by editing the corresponding
    #define MULTIPLY
    line in
    multiply.hpp
    .
  3. Compile your sample DPC++ application:
    cmake . && make
    This command generates a
    matrix.dpcpp
    executable.
    To delete the program, type:
    make clean
    This command removes the executable and object files that were created by the
    make
    command.

Ensure Prerequisites for GPU Analyses

Complete these steps before you run the
GPU Offload Analysis
or the
GPU Compute/Media Hotspots Analysis
.
  1. Prepare the system to run a GPU analysis. See Set Up System for GPU Analysis.
  2. Set up environment variables for Intel software tools:
    source $ONEAPI_ROOT/setvars.sh

Run GPU Offload Analysis on the DPC++ Application

Use the
GPU Offload Analysis
as a starting point to identify if an application is CPU or GPU bound. Explore GPU offload efficiency through data transfer analysis and find performance-critical kernels for further analysis and optimization.
Run GPU Offload Analysis
In the CLI, type:
vtune -collect gpu-offload -r ./result_gpu-offload -- ./matrix.dpcpp
By default,
VTune
Profiler
generates a summary report after collecting data. This report includes information on the following fields:
  • Elapsed time
  • GPU utilization information
  • Information about the hottest computing tasks
  • Recommendations
To see the summary report, type:
vtune -report summary -r ./result_gpu-offload
If you do not need to see the summary report immediately after data collection, change this setting with the
-no-summary
option:
vtune -collect gpu-offload -no-summary -r ./result_gpu-offload -- ./matrix.dpcpp
Generate Additional Reports to View Collected Data
  • CPU Hotspots Report
    This report displays a list of executed functions with CPU Time metrics, module names, source file paths and other parameters. The report also lists the hottest program units, starting with the most performance-critical unit. Use the
    -column
    ,
    -filter
    , and
    -limit
    options to sort data into a tabular view:
    vtune -report hotspots -r ./result_gpu-offload
    Hotspots Report
  • CPU Hotspots Report Filtered by Module and Grouped by Function
    Use the option to focus on a specific part of report like a particular module. You can then use option to group results in a specific sequence.
    vtune -report hotspots -r ./result_gpu-offload -group-by=function -filter module=matrix.dpcpp -q
    You can group the generated data in several ways like function name, module, source file path, or computing task.
    To see available groupings for a specific result, type:
    vtune -report hotspots -r ./result_gpu-offload -group-by=?
  • CPU Hotspots Report Sorted by Order
    Use the and options to sort specific information about hotspots in descending or ascending order. You can specify an order for up to three columns.
    vtune -report hotspots -r result_gpu-offload -group-by module -sort-desc="CPU Time:Execution" -q
    Here is another example:
    vtune -report hotspots -r result_gpu-offload -group-by module -sort-asc="CPU Time:Idle" -q
    To see available columns for a specific result, type:
    vtune -report hotspots -r ./result_gpu-offload -column=?
    The report data can contain such columns as
    CPU Time:Self
    ,
    Module
    , and
    Source File
    .
  • Report of Top 'n' Time-Intensive Program Modules
    Use the limit option to see information about the top 'n' hotspots. For example, to understand details about the top five time-intensive program modules in your application, type:
    vtune -report hotspots -r result_gpu-offload -group-by module -sort-desc="CPU Time" -limit=5 -q
  • Hotspots Report Grouped by Computing Task (offloaded on GPU) with Transfer Columns
    This command displays hotspots information grouped by GPU computing task and also lists details about transfer sizes and transfer times between CPU and GPU:
    vtune -report hotspots -r ./result_gpu-offload -group-by=computing-task -column=Transfer -q
    The report contains data transfers that are attributed to the respective computing task.
  • Hotspots Report Grouped by GPU Offload Computing Task and Time Columns
    This command displays hotspots information grouped by offload computing tasks and also lists details about transfer times between CPU and GPU:
    vtune -report hotspots -r ./result_gpu-offload -group-by=computing-task-offload -column='Time' -q

Run GPU Compute/Media Hotspots Analysis

Our next step is to run the
GPU Compute/Media Hotspots
analysis. This analysis can help us to further explore performance improvements for the GPU-bound application or its stages.
Run GPU Compute/Media Hotspots Analysis
In the CLI, type this command to run the analysis:
vtune -collect gpu-hotspots -r ./result_gpu-hotspots -- ./matrix.dpcpp
To see the summary report, type:
vtune -report summary -r ./result_gpu-hotspots
Generate Report to View Computing Tasks with L3 Metrics
Use this command to generate a report that lists only L3 metrics for computing tasks:
vtune -report hotspots -r result_gpu-hotspots -group-by=computing-task -column='L3' -q
Run GPU Compute/Media Hotspots Analysis with Dynamic Instruction Count and SIMD Utilization
Run the GPU Compute/Media Hotspots Analysis in the Characterization mode to collect data on dynamic instruction count and SIMD utilization:
vtune -collect gpu-hotspots -knob characterization-mode=instruction-count -r ./result_gpu-hotspots_inst-count -- ./matrix.dpcpp
Generate Reports to View Source and Assembly Metrics
  • Source Code for Specific Computing Tasks
    Use this command to get the source code for a specific computing task:
    vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=gpu-source-line -column="Source","GPU Instructions Executed:Int32 & SP Float" -q
  • Assembly Code for Specific Computing Tasks
    Use this command to get the assembly code for a specific computing task:
    vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=address -limit=5 -q
  • Save Report as CSV File
    Use the option to save the generated report as a file. To specify the generation of a
    .csv
    report, use and options:
    vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=address -limit=5 -report-output=result.csv -format=csv -csv-delimiter=comma -q

Run Custom Analysis with GPU Programming API Statistics

To get a focused analysis of timing and statistics related to GPU compute kernels, follow the GPU Compute/Media Hotspots analysis with a custom analysis that collects
GPU Programming API
statistics.
The kernel data available through this collection is similar to the data you collect when running the
CLIntercept
tool (with DevicePerformanceTiming option enabled) and with the
nvprof
tool in Summary mode.
Collect GPU Programming API Statistics
In the command line, type:
vtune -collect-with runss -knob collect-programming-api=true -no-summary -r ./result_gpu-programming-api -- ./matrix.dpcpp
Generate Report to View Timing and Statistics for GPU Compute Kernels
This command generates a report that lists timings and instance count for computing tasks. The data is sorted by
Total Time
in descending order.
vtune -report hotspots -group-by=source-computing-task -column="Total Time,Average Time,Instance Count" -sort-desc="Total Time" -r ./result_gpu-programming-api/ -q
Discuss this recipe in the
VTune
Profiler
developer forum
.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.