GPU Compute/Media Hotspots View
Use the
Intel® VTune™
's GPU Compute/Media Hotspots viewpoint to analyze how your GPU-bound code is utilizing GPU and CPU resources.
Profiler
Depending on the profiling mode selected for the GPU Compute/Media Hotspots analysis, you can explore your GPU-side code performance from different perspectives:
- Characterize performance issues for the code offloaded to the GPU:
- Analyze memory accesses using GPU hardware events.
- Analyze source for the most expensive operations and explore instruction execution:
Analyze Memory Accesses
The
Characterization
mode, which is enabled by default in the GPU Compute/Media Hotspots configuration, is a recommended entry-level analysis for GPU-bound applications.
The
Hottest GPU Computing Task
section of the
Summary
window displays the most time-consuming GPU tasks. Click such a task to switch the
Graphics
tab and explore GPU hardware metrics (by default, the
O
verview
set of metrics) collected for this hotspot:

Analyze GPU Instruction Execution
If you enabled the
Dynamic Instruction Count
preset as part of the Characterization analysis configuration, the
Graphics
tab shows a breakdown of instructions executed by the kernel in the following groups:

Control Flow group
| if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt and
mov, add instructions that explicitly change the ip register.
|
Send group
| send, sends, sendc, sendsc |
Synchronization group
| wait |
Int16 & HP Float |
Int32 & SP Float | Int64 & DP Float groups
| Bit operations (only for integer types):
and, or, xor, and others.
Arithmetic operations:
mul, sub, and others;
avg, frc, mac, mach, mad, madm .
Vector arithmetic operations: line, dp2, dp4, and others.
Extended math operations:
math.sin ,
math.cos ,
math.sqrt , and others.
|
Other group
| Contains all other operations including
nop .
|
The type of an operation is determined by the type of a destination operand.
In the
Graphics
tab, the
VTune
also provides the SIMD Utilization metric. This metric helps identify kernels that underutilize the GPU by producing instructions that cause thread divergence. A common cause of low SIMD utilization is conditional branching within the kernel, since the threads execute all of the execution paths sequentially, with each thread executing one path while the other threads are stalled.
Profiler
To get additional information, double-click the hottest function to open the source view. Enable both the
Source
and
Assembly
panes to get a side-by-side view of the source code and the resulting assembly code. You can then locate the assembly instructions with low SIMD Utilization values and map them to specific lines of code by clicking on the instruction. This allows you to determine and optimize the kernels that do not meet your desired SIMD Utilization criteria.

For information on the Instruction Set Architecture (ISA) of Intel® Iris® X
e
MAX Graphics, see the
Intel® Iris® Xe MAX Graphics Open Source Programmer's Reference Manual.
Analyze Source
If you selected the
Source Analysis
mode for the GPU Compute/Media Hotspots analysis, you can analyze a kernel of interest for basic block latency or memory latency issues. To do this, in the
Graphics
tab, expand the kernel node and double-click the function name.
VTune
redirects you to the hottest source line for the selected function:
Profiler

The GPU Compute/Media Hotspots provides a full-scale analysis of the kernel source per code line. The hottest kernel code line is highlighted by default.
To view the performance statistics on GPU instructions executed per kernel instance, switch to the
Assembly view:

If your OpenCL kernel uses inline functions, make sure to enable the
Inline Mode
on the filter toolbar to have a correct attribution of the GPU Cycles per function.
See examples.Examine Energy Consumption by your GPU
In Linux environments, when you run the GPU Compute/Media Hotspots analysis on an Intel® Iris® X
e
MAX graphics discrete GPU, you can see energy consumption information for the GPU device. To collect this information, make sure you check the
Analyze power usage
option when you configure the analysis.

Once the analysis completes, see energy consumption data in these sections of your results.
In the
Graphics
window, observe the
Energy Consumption
column in the grid when grouped by
Computing Task
. Sort this column to identify the GPU kernels that consumed the most energy. You can also see this information mapped in the timeline.
Tune for Power Usage
When you locate individual GPU kernels that consume the most energy, for optimum power efficiency, start by tuning the top energy hotspot.
Tune for Processing Time
If your goal is to optimize GPU processing time, keep a check on energy consumption metrics per kernel to monitor the tradeoff between performance time and power use.
Move the
Energy Consumption
column next to
Total Time
to make this comparison easier.

You may notice that the correlation between power use and processing time is not direct. The kernels that compute the fastest may not be the same kernels that consume the least amounts of energy. Check to see if larger values of power usage correspond to longer stalls/wait periods.
Energy consumption metrics do not display in GPU profiling analyses that scan Intel® Iris® X
e
MAX graphics on Windows machines.
Example: Basic Block Latency Profiling
You have an OpenCL kernel that performs compute operations:
__kernel void viete_formula_comp(__global float* data)
{
int gid = get_global_id(0);
float c = 0, sum = 0;
for (unsigned i = 0; i < 50; ++i)
{
float t = 0;
float p = (i % 2 ? -1 : 1);
p /= i*2 + 1;
p /= pown(3.f, i);
p -=c;
t = sum + p;
c = (t - sum) - p;
sum = t;
}
data[gid] = sum * sqrt(12.f);
}
To compare these operations, run the GPU In-kernel profiling in the
Basic block latency
mode and double-click the kernel in the grid to open the Source view:

The Source view analysis highlights the
pown()
call as the most expensive operation in this kernel.
Example: Memory Latency Profiling
You have an OpenCL kernel that performs several memory reads (lines 14, 15 and 20):
__kernel void viete_formula_mem(__global float* data)
{
int gid = get_global_id(0);
float c = 0;
for (unsigned i = 0; i < 50; ++i)
{
float t = 0;
float p = (i % 2 ? -1 : 1);
p /= i*2 + 1;
p /= pown(3.f, i);
p -=c;
t = data[gid] + p;
c = (t - data[gid]) - p;
data[gid] = t;
}
data[gid] *= sqrt(12.f);
}
To identify which read instruction takes the longest time, run the GPU In-kernel Profiling in the
Memory latency
mode:

The Source view analysis shows that the compiler understands that each thread works only with its own element from the input buffer and generates the code that performs the read only once. The value from the input buffer is stored in the registry and reused in other operations, so the compiler does not generate additional reads.