Measure GPU Performance Using GPU Roofline
- Measures the hardware limitations and collects OpenCL™, OpenMP*, oneAPI Level Zero (Level Zero) and Data Parallel C++ (DPC++) kernels timings and memory data using theSurvey analysiswith GPU profiling.
- Collects floating-point and integer operations data using theTrip Counts and FLOP analysiswith GPU profiling.
- Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPS) per byte for FLOAT Roofline chart and in number of integer operations (INTOPS) per byte for INT Roofline chart based on the kernel algorithm, transferred between GPU and memory
- Performance (y axis) - measured in billions of floating-point operations (GFLOPS) per second for FLOAT Roofline chart and in billions of integer operations (GINTOPS) per second for INT Roofline chart
- Dotsrepresent kernels. The size and color of each dot represent relative execution time for each kernel. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth optimizing.
- Diagonal linesindicatememory bandwidth limitationspreventing kernels from achieving better performance without some form of optimization.Depending on your system configuration the following rooflines might be available on the Roofline chart:
- L3 cache roof: Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
- SLM cache roof: Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
- GTI roof: Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
- DRAM roof: Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
- Horizontal linesindicatecompute capacity limitationspreventing kernels from achieving better performance without some form of optimization.
- A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine. However, not all kernels can utilize maximum machine capabilities.
- The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.
- The dots on the chart correspond toOpenCL, OpenMP, Level Zero and DPC++ kernels, while in the CPU version, they correspond to individual loops.
- Some displayed information and controls (for example, thread/core count) are not relevant to GPU Roofline. For more information, see the table below.
- The GPU Roofline chart enables you to view arithmetic intensity of one kernel at multiple memory levels. To do so, double-click a dot representing this kernel or select it and press ENTER. The dots that appear on the Roofline chart correspond to different memory levels used to calculate arithmetic intensity. Hover over a dot to identify its arithmetic intensity. To show or hide certain dots from a chart, use theMemory Leveldrop-down filter.
advisor --collect=roofline --profile-gpu --project-dir=./advi --search-dir src:p=./advi –- myApplication
- Collect performance metrics for loops/functions of your application using Survey analysis:advisor --collect=survey --profile-gpu --project-dir=./advi --search-dir src:p=./advi –- myApplication
- Collect floating-point operations data using Characterization analysis:advisor --collect=tripcounts --no-trip-counts --flop --profile-gpu --project-dir=./advi --search-dir src:p=./advi –- myApplicationWhere:
- no-trip-countsdisables collection of trip counts during Characterization analysis.
- flopenables collection of data about floating-point and integer operations, memory traffic, and mask utilization metrics for AVX-512 platforms during Characterization analysis.
- Survey analysis that collects loops/functions execution time data and measure L3, SLM, and GTI traffic.
- Characterization analysis that collects floating-point and integer operations considering mask utilization, and CARM memory traffic to measure arithmetic intensity and performance of your application.
advisor --report=roofline --profile-gpu --report-output=./advi/advisor-roofline.html --project-dir=./advi
advisor --help report
advisor --snapshot --project-dir=./advi --pack --cache-sources --cache-binaries -- /tmp/my_proj_snapshot