Tutorial

  • 2021.1
  • 12/04/2020
  • Public Content

Summary

This topic is part of a
tutorial
that shows how to use the automated
Roofline
chart to make prioritized optimization decisions.
The Roofline analysis is an optional analysis that plots an application's
achieved performance
and
arithmetic intensity
against the machine's
maximum achievable performance
.
Use the
Roofline
chart to answer the following questions:
  • What is the maximum achievable performance with your current hardware resources?
  • Does your application work optimally on current hardware resources?
  • If not, what are the best candidates for optimization?
  • Is memory bandwidth or compute capacity limiting performance for each optimization candidate?
Roofline analysis is cache-aware; it measures all memory subsystem traffic, not just DDR memory traffic. It works on both single-threaded and multithreaded code.
This tutorial showed how to use the
Vectorization Advisor
and the
roofline_demo_samples
C++ sample application to:
  • Run a Roofline analysis.
  • Focus on the
    Roofline
    chart data of most interest.
  • Interpret
    Roofline
    chart data.
  • Use
    Roofline
    chart data interpretations to make optimization decisions.
Step
Key Tutorial Take-aways
1. Prepare for tutorial.
If you worked in the Standalone Intel Advisor GUI: You built the target application in release mode with the Intel compiler, and created and configured a new
Intel Advisor
project to hold analysis results for the target.
If you worked in the Visual Studio* IDE: You opened the target solution and built the solution in release mode with the Intel compiler.
  • A target is an executable file the
    Intel Advisor
    can analyze.
  • To build applications that produce the most accurate and complete
    Vectorization Advisor
    analysis results, build an optimized binary of your application in release mode using the following settings:
    • /ZI
    • /DEBUG
    • /Qopt-report:5
    • /O2
      or higher
    • /Qvec
    • /Qsimd
    • /Qopenmp
You performed a Roofline analysis, and got to know
Roofline
chart data and controls.
  • The Roofline analysis is a combination of the Survey analysis followed immediately by the Trip Counts/FLOPs analysis. The Trip Counts/FLOPs analysis may run three to four times longer than the Survey analysis.
  • The size and color of each
    Roofline
    chart dot represent relative execution time for each loop/function. Large red dots take the most time; small green dots take less time.
  • Horizontal
    Roofline
    chart lines (
    rooflines
    ) indicate compute capacity limitations preventing loops/functions from achieving better performance without some form of optimization.
  • Diagonal
    Roofline
    chart lines indicate memory bandwidth limitations preventing loops/functions from achieving better performance without some form of optimization.
  • A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine; however, not all loops can utilize maximum machine capabilities.
  • The best candidates for the greatest performance improvement are large, red dots that are farther from the topmost achievable roofline.
  • The
    Roofline
    chart offers a variety of controls to configure appearance and focus on data of interest.
You opened a result snapshot, focused the
Roofline
chart on the data of most interest, and interpreted the data.
  • Memory bandwidth bottlenecks are generally overcome with cache optimizations.
  • Check data in other
    Intel Advisor
    views to support your
    Roofline
    chart interpretation.
You opened a result snapshot, focused the
Roofline
chart on the data of most interest, and interpreted the data.
  • Arithmetic Intensity (the x-axis of the
    Roofline
    chart) = Floating-point operations per byte accessed. Any given algorithm has an arithmetic Intensity. In theory, optimization should not change this metric because it is a trait of the algorithm itself. So dots on a
    Roofline
    chart move up and down as performance changes, but rarely side to side.
  • Optimizing a loop is not enough to make the corresponding dot rise to the next roofline; a loop must make
    good
    use of the optimization. Inefficient vectorization is not good enough; an isolated fused multiply-add instruction (FMA) is not good enough.
  • In the right circumstances, you can use data layout and memory access optimizations to overcome both compute capacity and memory bandwidth limitations.
  • Take advantage of code-specific
    how-can-I-fix-this-issue?
    advice in the
    Recommendations
    tab.
You opened a result snapshot, focused the
Roofline
chart on the data of most interest, and interpreted the data.
  • The first roofline above a dot position isn't always the bottleneck; any roofline above a dot position could be the culprit
  • Even a roofline below a dot position can be a bottleneck; however, the farther a dot is positioned above a roofline, the less likely that roofline is causing the bottleneck.
  • If the first roofline above a dot position does not make logical sense, investigate the next roofline, and just keep working your way up the
    Roofline
    chart, using common sense, other
    Intel Advisor
    features, and your familiarity with your application to inform your investigation.
  • The
    Roofline
    chart is not a Data-In-Answers-Out utility; however, it puts you in the ballpark and guides you in the right direction to optimize your code.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.