GPU Compute/Media Hotspots Analysis (Preview)
Analyze the most time-consuming GPU kernels, characterize GPU usage based on GPU hardware metrics, identify performance issues caused by memory latency or inefficient kernel algorithms, and analyze GPU instruction frequency per certain instruction types.
This is a
PREVIEW FEATURE
. A preview feature may or may not appear in a future production release. It is available for your use in the hopes that you will provide feedback on its usefulness and help determine its future. Data collected with a preview feature is not guaranteed to be backward compatible with future releases.
Use the GPU Compute/Media Hotspots analysis to:
- Explore GPU kernels with high GPU utilization, estimate the effectiveness of this utilization, identify possible reasons for stalls or low occupancy and options.
- Explore the performance of your application per selected GPU metrics over time.
- Analyze the hottest SYCL* standards or OpenCL™ kernels for inefficient kernel code algorithms or incorrect work item configuration.
The GPU Compute/Media Hotspots analysis is a good next step if you have already run the
GPU Offload analysis and identified:
- a performance-critical kernel for further analysis and optimization;
- a performance-critical kernel that it is tightly connected with other kernels in the program and may slow down their performance.

How It Works: Intel Graphics Render Engine and Hardware Metrics
A GPU is a highly parallel machine where graphical or computational work is done by an array of small cores, or execution units (EUs). Each EU simultaneously runs several lightweight threads. When one of these threads is picked up for an execution, it can hide stalls in the other threads if the other threads are stalled waiting for data from memory or other units.
To use a full potential of the GPU, applications should enable the scheduling of as many threads as possible and minimize idle cycles. Minimizing stalls is also very important for graphics and general purpose computing GPU applications.
VTune
can monitor Intel Graphics hardware events and display metrics about integral GPU resource usage over a sampled period, for example, ratio of cycles when EUs were idle, stalled, or active as well as statistics on memory accesses and other functional units. If the
Profiler
VTune
traces GPU kernel execution, it annotates each kernel with GPU metrics.
Profiler
The scheme below displays metrics collected by the
VTune
across different parts of the
Intel® Processor Graphics Gen9:
Profiler

GPU metrics help identify how efficiently GPU hardware resources are used and whether any performance improvements are possible. Many metrics are represented as a ratio of cycles when the GPU functional unit(s) is in a specific state over all the cycles available for a sampling period.
Configure the Analysis
- Make sure you set up the system and enable required permissions for GPU analysis.
- For SYCL applications: make sure to compile your code with the-gline-tables-onlyand-fdebug-info-for-profilingIntel oneAPI DPC++ Compiler options.
- Create a project and specify an analysis system and target.
Run the Analysis
- Click the
(standalone GUI)/
(Visual Studio IDE)
Configure Analysistoolbar button to open theConfigure Analysiswindow . - Click anywhere in the title bar of theHOWpane. Open the Analysis Tree and selectGPU Compute/Media Hotspots (Preview)analysis from theAcceleratorsgroup. This analysis is pre-configured to collect GPU usage data, analyze GPU task scheduling and identify whether your application is CPU or GPU bound.If you have multiple Intel GPUs connected to your system, run the analysis on the GPU of your choice or on all connected devices. For more information, see Analyze Multiple GPUs.
- Choose and configure one of these analysis modes:
- Optionally, narrow down the analysis to specific kernels you identified as performance-critical (stalled or time-consuming) in the GPU Offload analysis, and specify them asComputing tasks of interestto profile. If required, modify theInstance stepfor each kernel, which is a sampling interval (in the number of kernels). This option helps reduce profiling overhead.
- (Optional)To collect data on energy consumption, check theAnalyze power usageoption. This feature is available when you profile applications in a Linux environment and use an Intel® Iris® XeMAX graphics discrete GPU.
- ClickStartto run the analysis.
Run from Command Line
To run the GPU Compute/Media Hotspots analysis from the command line, type:
To
generate the command line for this configuration, use the
Command Line...
button at the bottom.
Analyze Multiple GPUs
If you connect multiple Intel GPUs to your system,
VTune
identifies all of these adapters in the
Profiler
Target GPU
pulldown menu. Follow these guidelines:
- Use theTarget GPUpulldown menu to specify the device you want to profile.
- TheTarget GPUpulldown menu displays only whenVTunedetects multiple GPUs running on the system. The menu then displays the name of each GPU with the bus/device/function (BDF) of its adapter. You can also find this information on your Windows (see Task Manager) or Linux (runProfilerlspci) system.
- If you do not select a GPU,VTuneselects the most recent device family in the list by default.Profiler
- SelectAll devicesto run the analysis on all of the GPUs connected to your system.
- Full compute set inCharacterizationmode is not available for multi-adapter/tile analysis.
Once the analysis completes,
VTune
displays summary results per GPU including tile information in the
Profiler
Summary
window.
Analysis Results
Once the GPU Compute/Media Hotspots Analysis completes data collection, the
Summary
window displays metrics that describe:
- GPU time
- Occupancy
- Peak occupancy you can expect with the existing computing task configuration
- The most active computing tasks that ran on the GPU

Families of Intel® X
e
graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe
Graphics.
Analysis Results for Multiple GPUs
When you profile an application running on multiple Intel® GPUs, the
Summary
window of the GPU Compute/Media Hotspots Analysis displays results by grouping GPUs of the same Intel microarchitecture. Each architecture group then contains metric information for that group.
For each architecture group, you can see the values for these metrics:
- Occupancy
- GPU L3 Bandwidth Bound
For every adapter in an architecture group, you can also see the values for the XVE Array Active/Stalled/Idle metrics.

Configure Characterization Analysis
Use the
Characterization
configuration option to:
- Monitor the Render and GPGPU engine usage (Intel Graphics only)
- Identify which parts of the engine are loaded
- Correlate GPU and CPU data
When you select the
Characterization
radio button, the configuration section expands with additional options:
- Overviewmetric set includes additional metrics that track general GPU memory accesses such as Memory Read/Write Bandwidth, GPU L3 Misses, Sampler Busy, Sampler Is Bottleneck, and GPU Memory Texture Read Bandwidth. These metrics can be useful for both graphics and compute-intensive applications.
- Compute Basic (with global/local memory accesses)metric group includes additional metrics that distinguish accessing different types of data on a GPU: Untyped Memory Read/Write Bandwidth, Typed Memory Read/Write Transactions, SLM Read/Write Bandwidth, Render/GPGPU Command Streamer Loaded, and GPU EU Array Usage. These metrics are useful for compute-intensive workloads on the GPU.
- Compute Extendedmetric group includes additional metrics targeted only for GPU analysis on the Intel processor code name Broadwell and higher. For other systems, this preset is not available.
- Full Computemetric group is a combination of theOverviewandCompute Basicevent sets.
- Dynamic Instruction Countmetric group counts the execution frequency of specific classes of instructions. With this metric group, you also get an insight into the efficiency of SIMD utilization by each kernel.
The
Characterization
drop-down menu provides platform-specific presets of the GPU metrics. All presets, except for the
Dynamic Instruction Count
, collect data about execution units (EUs) activity: EU Array Active, EU Array Stalled, EU Array Idle, Computing Threads Started, and Core Frequency; and each one introduces additional metrics:
You can run the GPU Compute/Media Hotspots analysis in Characterization mode for Windows*, Linux* and Android* targets. However, you must have root/administrative privileges to run the analysis in this mode.
For the Characterization analysis, you can also collect additional data:
- Use theTrace GPU programming APIsoption to analyze SYCL, OpenCL™, or Intel Media SDK programs running on Intel Processor Graphics. This option may affect the performance of your application on the CPU side.For SYCL or OpenCL applications, you may identify the hottest kernels and identify the GPU architecture block where a performance issue for a particular kernel was detected.For Intel Media SDK programs, you may explore the Intel Media SDK tasks execution on the timeline and correlate this data with the GPU usage at each moment of time.Support limitations:
- OpenCL kernels analysis is possible for Windows and Linux targets running on Intel Graphics.
- Intel Media SDK program analysis is possible for Windows and Linux targets running on Intel Graphics.
- OnlyLaunch ApplicationorAttach to Processtarget types are supported.
In theAttach to Processmode if you attached to a process when the computing queue is already created,VTunewill not display data for the OpenCL kernels in this queue.Profiler - Use theAnalyze memory bandwidthoption to collect the data required to compute memory bandwidth. This type of analysis requires Intel sampling drivers to be installed.
- Use theGPU sampling internal, msfield to specify an interval (in milliseconds) between GPU samples for GPU hardware metrics collection. By default, theVTuneuses 1ms interval.Profiler
Configure Source Analysis
In the Source Analysis,
VTune
helps you identify performance-critical basic blocks, issues caused by memory accesses in the GPU kernels.
Profiler
When you select the
Source Analysis
radio button, the configuration pane expands a drop-down menu where you can select a profiling mode to specify a type of issues you want to analyze:
- Basic Block Latencyoption helps you identify issues caused by algorithm inefficiencies. In this mode,VTunemeasures the execution time of all basic blocks. Basic block is a straight-line code sequence that has a single entry point at the beginning of the sequence and a single exit point at the end of this sequence. During post-processing,ProfilerVTunecalculates the execution time for each instruction in the basic block. So, this mode helps understand which operations are more expensive.Profiler
- Memory Latencyoption helps identify latency issues caused by memory accesses. In this mode,VTuneprofiles memory read/synchronization instructions to estimate their impact on the kernel execution time. Consider using this option, if you ran the GPU Compute/Media Hotspots analysis in the Characterization mode, identified that the GPU kernel is throughput or memory-bound, and want to explore which memory read/synchronization instructions from the same basic block take more time.Profiler
In the
Basic Block Latency
or
Memory Latency
profiling modes, the GPU Compute/Media Hotspots analysis uses these metrics:
- Estimated GPU Cycles: The average number of cycles spent by the GPU executing the profiled instructions.
- Average Latency: The average latency of the memory read and synchronization instructions, in cycles.
- GPU Instructions Executed per Instance: The average number of GPU instructions executed per one kernel instance.
- GPU Instructions Executed per Thread: The average number of GPU instructions executed by one thread per one kernel instance.
If you enable the
Instruction count
profiling mode,
VTune
shows a breakdown of instructions executed by the kernel in the following groups:
Profiler

Control Flow group
| if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt and
mov, add instructions that explicitly change the ip register.
|
Send & Wait group
| send, sends, sendc, sendsc, wait |
Int16 & HP Float |
Int32 & SP Float | Int64 & DP Float groups
| Bit operations (only for integer types):
and, or, xor, and others.
Arithmetic operations:
mul, sub, and others;
avg, frc, mac, mach, mad, madm .
Vector arithmetic operations: line, dp2, dp4, and others.
Extended math operations.
|
Other group
| Contains all other operations including
nop .
|
In the
Instruction count
mode, the
VTune
also provides
Profiler
Operations per second
metrics calculated as a weighted sum of the following executed instructions:
- Bit operations (only for integer types):
- and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol- weight 1
- Arithmetic operations:
- add, addc, cmp, cmpn, mul, rndu, rndd, rnde, rndz, sub- weight 1
- avg, frc, mac, mach, mad, madm- weight 2
- Vector arithmetic operations:
- line- weight 2
- dp2, sad2- weight 3
- lrp, pln, sada2- weight 4
- dp3- weight 5
- dph- weight 6
- dp4- weight 7
- dp4a- weight 8
- Extended math operations:
- math.inv, math.log, math.exp, math.sqrt, math.rsq, math.sin, math.cos(weight 4)
- math.fdiv, math.pow(weight 8)
The type of an operation is determined by the type of a destination operand.
View Data
VTune
runs the analysis and opens the data in the
Profiler
GPU Compute/Media Hotspots
viewpoint providing various platform data in the following windows:
- Summarywindow displays overall and per-engine GPU usage, percentage of time the EUs were stalled or idle with potential reasons for this, and the hottest GPU computing tasks.
- Graphicswindow displays CPU and GPU usage data per thread and provides an extended list of GPU hardware metrics that help analyze accesses to different types of GPU memory. For GPU metrics description, hover over the column name in the grid or right-click and select theWhat's This Column?context menu option.
Support for SYCL* Applications using oneAPI Level Zero API
This section describes support in the GPU Compute/Media Hotspots analysis for SYCL applications that run OpenCL or
oneAPI Level Zero API in the back end.
VTune
supports version 0.91.10 of the oneAPI Level Zero API.
Profiler
Support Aspect
| SYCL application with OpenCL as back end
| SYCL application with Level Zero as back end
|
---|---|---|
Operating System
| Linux OS
Windows OS
| Linux OS
Windows OS
|
Data collection
| VTune
collects and shows GPU computing tasks and the GPU computing queue.
Profiler | VTune
collects and shows GPU computing tasks and the GPU computing queue.
Profiler |
Data display
| VTune
maps the collected GPU HW metrics to specific kernels and displays them on a diagram.
Profiler | VTune
maps the collected GPU HW metrics to specific kernels and displays them on a diagram.
Profiler |
Display Host side API calls
| Yes
| Yes
|
Source Assembler for computing tasks
| Yes
| Yes
|
Instrumentation for GPU code ( Source Analysis option or
Dynamic Instruction Count characterization option)
| Yes
| Yes
|
For a use case on profiling a SYCL application running on an Intel GPU, see
Profiling a SYCL App Running on a GPU in the Intel® VTune Profiler Performance Analysis Cookbook .
Support for DirectX Applications
This section describes support available in the GPU analysis to trace Microsoft® DirectX* applications running on the CPU host. This support is available in the
Launch Application
mode only.
Support Aspect
| DirectX Application
|
---|---|
Operating system
| Windows OS
|
API version
| DXGI, Direct3D 11, Direct3D 12, Direct3D 11 on 12
|
Display host side API calls
| Yes
|
Direct Machine Learning (DirectML) API
| Yes
|
Device side computing tasks
| No
|
Source Assembler for computing tasks
| No
|