gpu-hotspots Command Line Analysis
- Explore GPU kernels with high GPU utilization, estimate the effectiveness of this utilization, identify possible reasons for stalls or low occupancy and options.
- Explore the performance of your application per selected GPU metrics over time.
- Analyze the hottest DPC++ or OpenCL™ kernels for inefficient kernel code algorithms or incorrect work item configuration.
Configure Characterization Analysis
- Monitor the Render and GPGPU engine usage (Intel Graphics only)
- Identify which parts of the engine are loaded
- Correlate GPU and CPU data
- Overviewmetric set includes additional metrics that track general GPU memory accesses such as Memory Read/Write Bandwidth, GPU L3 Misses, Sampler Busy, Sampler Is Bottleneck, and GPU Memory Texture Read Bandwidth. These metrics can be useful for both graphics and compute-intensive applications.
- Compute Basic (with global/local memory accesses)metric group includes additional metrics that distinguish accessing different types of data on a GPU: Untyped Memory Read/Write Bandwidth, Typed Memory Read/Write Transactions, SLM Read/Write Bandwidth, Render/GPGPU Command Streamer Loaded, and GPU EU Array Usage. These metrics are useful for compute-intensive workloads on the GPU.
- Compute Extendedmetric group includes additional metrics targeted only for GPU analysis on the Intel processor code name Broadwell and higher. For other systems, this preset is not available.
- Full Computemetric group is a combination of theOverviewandCompute Basicevent sets.
- Dynamic Instruction Countmetric group counts the execution frequency of specific classes of instructions. With this metric group, you also get an insight into the efficiency of SIMD utilization by each kernel.
- Use theTrace GPU programming APIsoption to analyze DPC++, OpenCL™, or Intel Media SDK programs running on Intel Processor Graphics. This option may affect the performance of your application on the CPU side.For DPC++ or OpenCL applications, you may identify the hottest kernels and identify the GPU architecture block where a performance issue for a particular kernel was detected.For Intel Media SDK programs, you may explore the Intel Media SDK tasks execution on the timeline and correlate this data with the GPU usage at each moment of time.Support limitations:
In theAttach to Processmode if you attached to a process when the computing queue is already created,VTunewill not display data for the OpenCL kernels in this queue.Profiler
- OpenCL kernels analysis is possible for Windows and Linux targets running on Intel Graphics.
- Intel Media SDK program analysis is possible for Windows and Linux targets running on Intel Graphics.
- OnlyLaunch ApplicationorAttach to Processtarget types are supported.
- Use theGPU sampling internal, msfield to specify an interval (in milliseconds) between GPU samples for GPU hardware metrics collection. By default, theVTuneuses 1ms interval.Profiler
Configure Source Analysis
- Basic Blocks Latencyoption helps you identify issues caused by algorithm inefficiencies. In this mode,VTunemeasures the execution time of all basic blocks. Basic block is a straight-line code sequence that has a single entry point at the beginning of the sequence and a single exit point at the end of this sequence. During post-processing,ProfilerVTunecalculates the execution time for each instruction in the basic block. So, this mode helps understand which operations are more expensive.Profiler
- Memory Latencyoption helps identify latency issues caused by memory accesses. In this mode,VTuneprofiles memory read/synchronization instructions to estimate their impact on the kernel execution time. Consider using this option, if you ran the GPU Compute/Media Hotspots analysis in the Characterization mode, identified that the GPU kernel is throughput or memory-bound, and want to explore which memory read/synchronization instructions from the same basic block take more time.Profiler
- Estimated GPU Cycles: The average number of cycles spent by the GPU executing the profiled instructions.
- Average Latency: The average latency of the memory read and synchronization instructions, in cycles.
- GPU Instructions Executed per Instance: The average number of GPU instructions executed per one kernel instance.
- GPU Instructions Executed per Thread: The average number of GPU instructions executed by one thread per one kernel instance.
if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, haltand
mov, addinstructions that explicitly change the ip register.
Send & Waitgroup
send, sends, sendc, sendsc, wait
Int16 & HP Float|
Int32 & SP Float|
Int64 & DP Floatgroups
Bit operations (only for integer types):
and, or, xor,and others.
mul, sub,and others;
avg, frc, mac, mach, mad, madm.
Vector arithmetic operations:
line, dp2, dp4,and others.
Extended math operations.
Contains all other operations including
- Bit operations (only for integer types):
- and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol- weight 1
- Arithmetic operations:
- add, addc, cmp, cmpn, mul, rndu, rndd, rnde, rndz, sub- weight 1
- avg, frc, mac, mach, mad, madm- weight 2
- Vector arithmetic operations:
- line- weight 2
- dp2, sad2- weight 3
- lrp, pln, sada2- weight 4
- dp3- weight 5
- dph- weight 6
- dp4- weight 7
- dp4a- weight 8
- Extended math operations:
- math.inv, math.log, math.exp, math.sqrt, math.rsq, math.sin, math.cos(weight 4)
- math.fdiv, math.pow(weight 8)
vtune -collect gpu-hotspots -knob enable-gpu-runtimes=true -- /home/test/myApplication