Examine Bottlenecks on GPU Roofline Chart
Explore Performance-Limiting Factors
- Horizontal lines indicatecompute capacity limitationspreventing kernels from achieving better performance without some form of optimization.
- Diagonal lines indicatememory bandwidth limitationspreventing kernels from achieving better performance without some form of optimization:
- L3 cache roof: Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
- SLM cache roof: Shared local memory (SLM). Represents the maximal bandwidth of the SLM for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
- GTI roof: Graphics technology interface (GTI). Represents the maximum bandwidth between the GPU and the rest of the system on a chip (SoC). This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
- DRAM roof: Dynamic random-access memory (DRAM). Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
- HBM roof: High bandwidth memory (HBM). Represents the maximum bandwidth of HBM memory available to your current graphics hardware. Measured using an optimized sequence of load operations iterating over an array that does not fit indiscrete GPUcaches.
Identify Hotspots and Estimate Room for Optimization
- Small green dots represent kernels with relatively small execution time (0-1% of the total time).
- Medium-sized yellow dots represent kernels with medium-range execution time (1-20% of the total time).
- Large red dots represent kernels with the largest execution time (20-100% of the total time).
- Their size clearly shows that improving self elapsed time for these kernels has a significant impact on the total time of the program.
- Their location shows that there is a significant headroom for optimization.
Explore Kernel Performance at Different Memory Levels
- Expand the filter pane in the GPU Roofline chart toolbar.
- In theMemory Levelsection, select the memory levels you want to see metrics for.By default, GPU Roofline reports data for GTI memory level (for integrated graphics) and HBM/DRAM memory level (for discrete graphics).
- CARM: Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
- L3: Data transferred directly between execution units and L3 cache.
- SLM: Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
- HBM: the maximum bandwidth of HBM memory available to your current graphics hardware. The HBM roof is measured using an optimized sequence of load operations iterating over an array that does not fit indiscrete GPUcaches.
- GTI: GPU memory read bandwidth, which is the accesses between the GPU, chip uncore (LLC), and main memory onintegrated GPUs. Use this to understand external memory traffic.
- DRAM: Maximum DRAM memory bandwidth available to your current GPU. The DRAM roof is measured using an optimized sequence of load operations iterating over an array that does not fit in GPU caches. This roof represents the maximum bandwidth between the GPU, chip uncore (LLC), and main memory ondiscrete GPUs.
- Spends less time on transferring data between L3 and CARM memory levels
- Uses as much data as possible for actual calculations
- Enhances the elapsed time of the kernel and of the entire application
Determine If Your Kernel Is Compute or Memory Bound
- Guidance on possible optimization steps depending on the factor limiting performance. Click the bounding factor to expand the hint.
- Amount of data transferred for each cache memory level.
- The exact roof that limits the kernel performance. The arrow points to what you should optimize the kernel for and shows the potential speedup after the optimization in the callout.If the arrow points to a diagonal line, the kernel is mostly memory bound. If the arrow points to a horizontal line, the kernel is mostly compute bound.Intel® Advisordisplays a compute roof limiting the performance of your kernel based on the instruction mix used.
Investigate Performance of Compute Tasks
- In the GPU Roofline chart:
- Click a dot on a Roofline chart and click the+button that appears next to the dot. The dot expands into several dots representing the corresponding compute tasks.
- Click a dot representing a compute task and view details about its global and local work size in theGPU Detailspane.
- Hover over a dot representing a compute task to review and compare its performance metrics. Double-click the dot to highlight a roofline limiting the performance of a given instance.
- In theGPUpane grid:
- Expand a kernel in theKernelscolumn.
- View the information about the work size of a compute task by expanding theWork Sizecolumn in the grid. To view the number of compute tasks with a given global/local size, expand theKernel Detailscolumn in the grid and examine theInstancesmetric.
- Compare performance metrics for different compute tasks using the grid and theGPU Detailspane.