Profiling an OpenMP* Offload Application running on a GPU (NEW)
This recipe illustrates how you can build and compile an OpenMP* application offloaded onto an Intel GPU. The recipe also describes how to use
Intel® VTune™
to run analyses with GPU capabilities (HPC Performance Characterization, GPU Offload, and GPU Compute/Media Hotspots) on the OpenMP application and examine results.
Profiler
Content expert
: Sunny Gogar and Nikita Kiryuhin
- DIRECTIONS:
Ingredients
Here are the minimum hardware and software requirements for this performance analysis.
- Application:iso3dfd_omp_offloadOpenMP Offload sample. This sample application is available as part of the code sample package for Intel® oneAPI toolkits.
- Compiler: To profile a SYCL* application, you need theIntel® oneAPI(DPC++/C++Compilericx/icpx) that is available with the Intel® oneAPI Base Toolkit.
- Tools:Intel® VTune™Profiler
- HPC Performance Characterization analysis
- GPU Offload analysis
- GPU Compute/Media Hotspots analysis
- Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed toIntel® VTune™.Profiler
- Most recipes in theIntel® VTune™Performance Analysis Cookbook are flexible. You can apply them to different versions ofProfilerIntel® VTune™. In some cases, minor adjustments may be required.Profiler
- Get the latest version ofIntel® VTune™:Profiler
- From theIntel® VTune™product page.Profiler
- Download the latest standalone package from the Intel® oneAPI standalone components page.
- Microarchitecture:
- Intel Processor Graphics Gen 9
- Operating system:
- Linux* OS, kernel version 4.14 or newer
- Windows* 10 OS
- System Configuration:
- Linux* OS: Follow instructions in Configure Your CPU or GPU System (Linux)
- Windows*: Follow instructions in Configure Your CPU or GPU System (Windows)
Build and Compile the OpenMP Offload Application
On Linux OS:
- Set oneAPI environment variables. Runsetvars.h. You can find this script here:. /opt/intel/oneapi/setvars.sh
- Go to the sample directory.cd <sample_dir>/DirectProgramming/C++/StructuredGrids/iso3dfd_omp_offload
- Compile the OpenMP Offload application:mkdir build; cd build; cmake -DVERIFY_RESULTS=0 -DCMAKE_CXX_FLAGS="-g -mllvm -parallel-source-info=2" .. make -jThis generates ansrc/iso3dfdexecutable.To delete the program, type:make cleanThis removes the executable and object files that were created by themakecommand.
On Windows OS:
- Set oneAPI environment variables. Runsetvars.bat. You can find this script here:C:\Program Files (x86)\Intel\ oneAPI \setvars.bat.
- Open the sample directory:cd <sample_dir>/ DirectProgramming/C++/StructuredGrids/iso3dfd_omp_offload
- Compile the OpenMP Offload application:mkdir build cd build icx /Zi -mllvm -parallel-source-info=2 /std:c++17 /EHsc /Qiopenmp /I../include\ /Qopenmp-targets:spir64 /DUSE_BASELINE /DEBUG ..\src\iso3dfd.cpp ..\src\iso3dfd_verify.cpp ..\src\utils.cpp
Run HPC Performance Characterization Analysis on the OpenMP Offload Application
To get a high-level summary of the performance of the OpenMP Offload application, run the HPC Performance Characterization analysis. This analysis type can help you understand how your application utilizes the CPU, GPU, and available memory. You can also see the extent to which your code is vectorized.
For OpenMP offload applications, the HPC Performance Characterization analysis shows you the hardware metrics associated with each of your OpenMP offload regions.
Prerequisites
: Prepare the system to run a GPU analysis. See
Set Up System for GPU Analysis.
- OpenVTuneand click onProfilerNew Projectto create a project.
- On the welcome page, click onConfigure Analysisto set up your analysis.
- Select these settings for your analysis.
- In theWHEREpane, selectLocal Host.
- In theWHATpane, selectLaunch Applicationand specify theiso3dfd_omp_offloadbinary as the application to profile.
- In theHOWpane, select theHPC Performance Characterizationanalysis type from theParallelismgroup in the Analysis Tree.
- Click theStartbutton to run the analysis.
Run Analysis from Command Line:
To run the HPC Performance Characterization analysis form the command line:
- On Linux OS:
- SetVTuneenvironment variables by exporting the script:Profilerexport <install_dir>/vtune-vars.sh
- Run the HPC Performance Characterization analysis:vtune -collect hpc-performance -- src/iso3dfd 256 256 256 16 8 64 100
- On Windows OS:
- SetVTuneenvironment variables by running the batch file:Profiler<install_dir>\vtune-vars.bat
- Run the HPC Performance Characterization analysis:vtune -collect hpc-performance -- iso3dfd.exe 256 256 256 16 8 64 100
Analyze HPC Performance Characterization Data
Start your analysis by examining the
Summary
pane. Look at the
Effective Physical Core Utilization
(or
Effective Logical Core Utilization
) and
GPU Utilization when Busy
sections to see highlighted issues, if any.

In the
GPU Utilization when Busy
section, look at the top OpenMP offload regions sorted by offload time spent in those regions. You can see GPU utilization in each of these offload regions.

If you compiled your application with the full set of debug information, the names of the regions will contain their source locations. This includes:
- Name of the function
- Name of the source file
- Line number
Generate a Summary Report From the Command Line
To generate a summary report from the command line, type:
vtune -report summary -r <result>
In this example, the offload activity is classified almost entirely as
Compute
activity. Also, a single offload region consumed the majority of offload time. Click on its name to switch to the
Bottom-Up
view. Examine the grouping table with OpenMP offload region durations, region instance counts, and metrics for GPU and CPU.

Hover over the region markers at the top of the timeline view. You can see the name and duration of each offload region and offload operation within that region. The GPU metrics in the timeline help you understand how every instance of an offload region behaves over time.
Generate a Hotspots Report Grouped by Offload Region From the Command Line
To generate a Hotspots report (grouped by offload region) from the command line, type:
vtune -report hotspots -group-by=offload-region -r <result>

These details establish clearly that GPU activity played an important role in the performance of this application. Next, let us move to the GPU Offload Analysis to learn more.
Run GPU Offload Analysis on the OpenMP Offload Application
Prerequisites
: If you have not already done so, prepare the system to run a GPU analysis. See
Set Up System for GPU Analysis.
- From the Analysis Tree, select theGPU Offloadanalysis type from theAcceleratorsgroup.
- Select these settings for your analysis:
- Click theStartbutton to run the analysis.
Run Analysis from Command Line:
To run the GPU Offload analysis form the command line:
- On Linux OS, type:vtune -collect gpu-offload - src/iso3dfd 256 256 256 16 8 64 100
- On Windows OS, type:vtune -collect gpu-offload - iso3dfd.exe 256 256 256 16 8 64 100
Analyze GPU Offload Analysis Data
Start your analysis with the
GPU Offload
viewpoint.
In the
In this example, the application should use the GPU for intensive computation. However, the result summary informs that GPU usage is actually low.
Summary
window, see statistics on CPU and GPU resource usage. Use this data to determine if your application is:
- GPU-bound
- CPU-bound
- Utilizing the compute resources of your system inefficiently
Families of Intel® X
e
graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe
Graphics.

Switch to the
Platform
window. Here, you can see basic CPU and GPU metrics that help analyze GPU usage on a software queue. This data is correlated with CPU usage on the timeline.

The information in the
Platform
window can help you make some inferences.
GPU Bound Applications
| CPU Bound Applications
|
---|---|
The GPU is busy for a majority of the profiling time.
| The CPU is busy for a majority of the profiling time.
|
There are small idle gaps between busy intervals.
| There are large idle gaps between busy intervals.
|
The GPU software queue is rarely reduced to zero.
|
Most applications may not present obvious situations as described here. A detailed analysis is important to understand all dependencies. For example, GPU engines that are responsible for video processing and rendering are loaded in turns. In this case, they are used in a serial manner. When the application code runs on the CPU, this can cause an ineffective scheduling on the GPU. The behavior can mislead you to interpret the application to be GPU bound.
Identify the GPU execution phase based on the computing task reference and
GPU Utilization
metrics. Then, you can define the overhead for creating the task and placing it into a queue.
To investigate a computing task, switch to the
Graphics
window to examine the type of work (rendering or computation) running on the GPU per thread. Select the
Computing Task
grouping and use the table to study the performance characterization of your task.

Generate a Hotspots Report Grouped by Computing Task From the Command Line
To generate a Hotspots report (grouped by computing task) from the command line, type:
vtune -report hotspots -group-by=computing-task -r <result>
Use the README file in the sample to profile other implementations of
iso3dfd_omp_offload code
.
In the next section, continue your investigation with the
GPU Compute/Media Hotspots analysis.
Run GPU Compute/Media Hotspots Analysis on the OpenMP Offload Application
Prerequisites
: If you have not already done so, prepare the system to run a GPU analysis. See
Set Up System for GPU Analysis.
To run the analysis:
- In theAcceleratorsgroup, select theGPU Compute/Media Hotspotsanalysis type.
- Configure analysis options as described in the previous section.
- Click theStartbutton to run the analysis.
Run Analysis from Command Line
To run the analysis from the command line:
- On Linux OS:vtune -c gpu-hotspots -knob profiling-mode=source-analysis - src/iso3dfd 256 256 256 16 8 64 100
- On Windows OS:vtune -collect gpu-hotspots - iso3dfd.exe 256 256 256 16 8 64 100
Analyze Your Compute Task
The default analysis configuration invokes the
Characterization
profile with the Overview metric set. In addition to individual compute task characterization that is available through the
GPU Offload
analysis,
VTune
provides memory bandwidth metrics that are categorized by different levels of GPU memory hierarchy.
Profiler

For a visual representation of the memory hierarchy, see the
Memory Hierarchy Diagram
. This diagram reflects the microarchitecture of the current GPU and shows memory bandwidth metrics. Use the diagram to understand the data traffic between memory units and execution units. You can also identify potential bottlenecks that cause EU stalls.

You can also analyze compute tasks at the source code level. For example, you can count GPU clock cycles spent on a particular task or due to memory latency. Use the
Source Analysis
option for this purpose.

Run Memory Latency Source Analysis from Command Line
To run the analysis with the Memory Latency Source Analysis option from the command line:
- On Linux OS:vtune -c gpu-hotspots -knob profiling-mode=source-analysis -knob source-analysis=mem-latency -r iso_ghs_src-analysis_mem - src/iso3dfd 256 256 256 16 8 64 100
In the source view, examine the
Average Latency Cycles
for the offload kernel.

Generate a Hotspots Report With Source for a Computing Task From the Command Line
To generate a Hotspots report (with source for a computing task) from the command line, type:
vtune -report hotspots -source-object 'computing-task=Iso3dfdIteration$omp$offloading:50' -group-by=gpu-source-line -r <result>
Run Basic Block Latency Source Analysis from Command Line
To run the analysis with the Basic Blocks Latency Source Analysis option from the command line:
- On Linux OS:vtune -c gpu-hotspots -knob profiling-mode=source-analysis -r iso_ghs_src-analysis - src/iso3dfd 256 256 256 16 8 64 100
In the source view, examine the
Average Latency Cycles
for the offload kernel.

Discuss this recipe in the
VTune
developer forum.
Profiler