Identify Performance Bottlenecks Using CPU Roofline
- Measures the hardware limitations of your machine and collects loop/function timings using theSurvey analysis.
- Collects floating-point and integer operations data, and memory data using theCharacterization analysis.
- Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPs) and/or integer operations (INTOPs) per byte, based on the loop/function algorithm, transferred between CPU/VPU and memory.
- Performance (y axis) - measured in billions of floating-point operations per second (GFLOPS) and/or billions of integer operations per second (GINTOPS).
- Dotsof different color and size represent functions/loops. The size and color of a dot represent execution time for this loop/function in relation to total execution time of the application. Large red dots are profitable to optimize as they take the longest execution time. Small green dots take less time and may be poor candidates for optimization.
- Diagonal linesindicate memory bandwidth limitations preventing loops/functions from achieving better performance without optimization. For example, theL1 Bandwidthroofline represents the maximum amount of work that can get done at a given arithmetic intensity if the loop always hits L1 cache. A loop does not benefit from L1 cache speed if a dataset causes it to miss L1 cache too often. In this case, it is subject to the limitations of the lower-speed L2 cache it is hitting. So, a dot representing a loop that misses L1 cache too often but hits L2 cache is positioned below theL2 Bandwidthroofline.
- Horizontal linesindicate compute capacity limitations preventing loops/functions from achieving better performance without optimization. For example, theScalar Add Peakrepresents the peak number of add instructions that can be performed by a scalar loop under these circumstances. TheVector Add Peakrepresents the peak number of add instructions that can be performed under these circumstances by a vectorized loop with the highest instruction set available. So, a dot representing a loop that is not vectorized is positioned somewhere below theScalar Add Peakroofline.
- A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine; however, not all loops can utilize maximum machine capabilities.
- The greater the distance between a dot and the highest achievable roofline, the more room for optimization a function/loop has.
advisor --collect=roofline --project-dir=./advi --search-dir src:p=./advi –- myApplication
- Survey analysis that collects loops/functions execution time data.
- Characterization analysis that collects floating-point and integer operations, memory traffic and mask utilization metrics for AVX-512 platforms to measure arithmetic intensity and performance of your application, and compute capacity of your hardware.
advisor --report=roofline --report-output=./advi/advisor-roofline.html --project-dir=./advi
advisor --help report
advisor --snapshot --project-dir=./advi --pack --cache-sources --cache-binaries -- /tmp/my_proj_snapshot
- Consider working with the most time-consuming function/loop indicated on a Roofline chart.
- Use theCode Analyticstab to examine the main information for the selected function/loop. Refer to theRooflinepane to identify whether the function/loop is compute or memory bound.
- UseRecommendationstab to view hints on possible optimization steps for the selected function/loop in theRoofline Guidancesection.
- If your loop is compute bound:
- Check theVectorized Loops/Efficiencyvalues in theSurvey Report.
- Consider running Dependencies analysis to discover why the compiler assumed a dependency and did not vectorize the selected function/loop.
- Consider running Memory Access Patterns (MAP) analysis to identify expensive memory instructions.
- If your loop is memory bound:
- Explore the common use cases described inIntel AdvisorCookbook:
- Explore useful Roofline Resources forIntel AdvisorUsers.