Window: Graphics - GPU Compute/Media Hotspots
Grid. Analyze basic performance metrics per program unit and identify the most time-consuming units. If your application uses the OpenCL software technology and you ran the analysis with the
Trace GPU Programming APIsoption enabled, the grid is grouped by
Computing Task Purposegranularity by default.
Analyze and optimize hot kernels with the longest Total Time values first. These include kernels characterized by long Average Time values and kernels whose Average Time values are not long, but they are invoked more frequently than the others (see Instance Count values). Both groups deserve attention. For more details, see GPU OpenCL™ Application Analysis.
To understand the CPU activity (which module/function was executed and its CPU time) while the GPU execution units were idle, queued, or busy executing some code, use the
GPU Render and EU Engine Stategrouping level:
Thread. Explore CPU and GPU utilization by a particular thread. The
Platformtab displays the thread name as a name of the module where the thread function resides. For example, if you have a
myFoofunction that belongs to
MyMegaFoo.dll(Windows*) function, the thread name is displayed as
MyMegaFoo.dll(Windows*) . This approach helps easily identify the location of the thread code producing the work displayed on the timeline.
Windows* targets only: Correlate CPU and GPU usage and estimate whether your application is GPU bound. GPU Engines Usage bars show DMA packets on CPU threads originating GPU tasks. The bars are colored according to the type of used GPU engine (yellow bars in the example above correspond to the Render and GPGPU engine).
GPU hardware metrics. If you enabled the
Analyze Processor Graphics hardware eventsoption for GPU analysis on the processors with the Intel® HD and Intel® Iris® Graphics, the
VTunedisplays the statistics for the selected group of metrics over time.
For example, for the default
Overviewgroup of metrics, you may start with
GPU Execution Units: EU Array Idlemetric. Idle cycles are wasted cycles. No threads are scheduled and the EUs' precious computational resources are not being utilized. If EU Array Idle is zero, the GPU is reasonably loaded and all EUs have threads scheduled on them.
In most cases the optimization strategy is to minimize the
EU Array Stalledmetric and maximize the
EU Array Active. The exception is memory bandwidth-bound algorithms and workloads where optimization should strive to achieve a memory bandwidth close to the peak for the specific platform (rather than maximize
EU Array Active).
Memory accesses are the most frequent reason for stalls. The importance of memory layout and carefully designed memory accesses cannot be overestimated. If the
EU Array Stalledmetric value is non-zero and correlates with the
GPU L3 Misses, and if the algorithm is not memory bandwidth-bound, you should try to optimize memory accesses and layout.
Sampler accesses are expensive and can easily cause stalls. Sampler accesses are measured by the
Sampler Is Bottleneckand
Computing Queue. Analyze details on OpenCL kernels submission, in particular distinguish the order of submission and execution, and identify the time spent in the queue, zoom in and explore the Computing Queue data.
VTunedisplays kernels with the same name and global/local size in the same color. Synchronization tasks are marked with vertical hatching . Data transfers are marked with cross-diagonal hatching .
You can click a kernel task to highlight the whole queue to the execution displayed at the top layer. Hover over an object in the queue to see kernel execution parameters.
Windows targets only: Switch to the Platform window to explore how the execution path of the OpenCL device queue correlates to the DMA packets software queue.
GPU Usage metrics. GPU usage bars are colored according to the type of used GPU engine.
Theoretically, if the
Platformtab shows that the GPU is busy most of the time and having small idle gaps between busy intervals and the GPU software queue is rarely decreased to zero, your application is GPU bound. If the gaps between busy intervals are big and the CPU is busy during these gaps, your application is CPU bound. But such obvious situations are rare and you need a detailed analysis to understand all dependencies. For example, an application may be mistakenly considered GPU bound when GPU engines usage is serialized (for example, when GPU engines responsible for video processing and for rendering are loaded in turns). In this case, an ineffective scheduling on the GPU results from the application code running on the CPU.