Visualize Performance Improvements with Roofline Compare
- Performance analysis tools:Intel Advisor2019 Update 4The latest version is available for download at: https://software.intel.com/content/www/us/en/develop/tools/advisor/choose-download.html
- Application: The Vector Sample code, available as part of the samples package in theIntel Advisorinstallation folder. The code and optimization process being followed in this recipe is part of theAdd Efficient SIMD Parallelism to Code Using the Vectorization Advisortutorial.Intel Advisor2019 tutorials are available at: https://software.intel.com/content/www/us/en/develop/articles/advisor-tutorials.html
- Compiler: Microsoft® C/C++ Optimizing Compiler Version 19.14.26431 for x64
- Operating system: Microsoft® Windows 10, Version 1709
- CPU: Intel® Core™ i5-7300HQ processor
Collect Baseline Roofline Results
- TheElapsed time valuein the top left corner. This is the baseline against which subsequent improvements will be measured.
- In theTypecolumn, all detected loops are scalar.
- In theWhy No Vectorization?column, the compiler detected or assumed vector dependence in most of the loops.
Optimize with the NOALIAS Macro
- Click theWhy No Vectorization?tab, then click one of the loops for which the compiler previously detected or assumed vector dependence.
- Scroll down to theRecommendationssection to view suggestions for vectorizing the loop. In the example below, one of the suggestions is to use therestrictkeyword.restrictensures that two pointers cannot point to overlapping memory regions. If the compiler knows that there is only one pointer to a memory block, it can produce better vectorized code. In the first optimization, we will try to limit the effect of pointer aliasing by providing some information to the compiler using theNOALIASmacro.
- In the Visual Studio* IDE, right-click thevec_samplesproject in theSolution Explorer, then chooseProperties.
- ChooseConfiguration Properties > C/C++ > Command Line. In theAdditional Optionsarea, type/DNOALIAS.
- ClickApply, then clickOK.
- ChooseBuild > Rebuild Solution.
Re-run the Roofline Analysis
- In theVectorization Workflowpane, click theCollectbutton belowRun Rooflineand save a snapshot of the result asSnapshot_NoAlias(preferably in a new directory, though this is not strictly required).
- Load the Favourable View json file by clicking the menu icon in the top right corner. Once the file is loaded, the roofs are adjusted accordingly to Snapshot_Baseline.
- Notice the improvements in the total performance of the program and loop inmatvecatMultiply.c:60, as shown in the image below.
- In the Survey report, notice that:
- The value in theVector Instruction Setcolumn is probablyAVX2/AVX/SSE2, i.e., the default vector Instruction Set Architecture (ISA).
- The compiler successfully vectorizes two loops: inmatvecatMultiply.c:69and inmatvecatMultiply.c:60.
- Elapsed timeimproves substantially.
- Open theSnapshot_Baselinesnapshot.
- InSnapshot_Baseline, go to the Roofline plot and click theComparedrop-down list , followed by the+ Load result for comparisonicon.Intel Advisorshows any snapshots in the same directory as Snapshot_Baseline in theReady for comparisonlist. These snapshots can be used for Roofline comparisons. SelectSnapshot_NoAliasusing theLoad result for comparisonoption.You can remove a comparison result using the× Clear comparison result(s)icon.
- Notice the loop inmatvecatMultiply.c:60in the Roofline plot has changed its color, as it was scalar in Snapshot_Baseline and vectorized in Snapshot_NoAlias.
- The Roofline Compare feature automatically recognizes similar loops from both snapshots. It connects related loops with a dashed line and displays the performance improvement between the loops, i.e., the difference in FLOPS (or INTOPS or OPS) and Total Time.
- To find the same loops among the results,Intel Advisorcompares several loop features, such as loop type, nesting level, source code file name and line, and name of the function. When a certain threshold of similar or equal features is reached, the two loops are considered a match and connected with a dashed line.
- However, this method still has few limitations. Sometimes there can be no match for the same loop if one is optimized or parallelized or moved in the source code to four or more lines from the original code.
- Intel Advisortries to ensure some balance between matching source code changes and false positives.
- 71.92%=6.02units /8.37units * 100%
- 8.37GFLOPS – Performance value for the compared loop
- 2.35GFLOPS – Performance value for the current loop
- -71.91%=-2.028s /2.820s * 100%
- 0.792s – Total Time value for the compared loop
- 2.820s – Total Time value for the current loop
- In the side-by-side view of the Survey report and Roofline comparison plot above, clicking on each loop in the Survey report highlights the corresponding loop in the Roofline plot and also highlights the dashed line connecting similar loops. Note that in the image above, we have removed the Filter In Selection feature to visualize this better.
- From the Roofline snapshot and Survey report for Snapshot_NoAlias, we can see that there is still room for improvement for the loops in Snapshot_NoAlias.
Continue to Optimize: Dependencies and More
- In the drop column in the Survey report, select the checkbox for the loop inmatvecatMultiply.c:82.
- In theVectorization Workflowpane, click theCollectbutton underCheck Dependenciesto produce a dependencies report.
- Usually, the Dependencies analysis takes a while to generate the report. If analysis time during this exercise is a consideration: click theStopbutton underCheck Dependenciesto stop the current analysis once the site coverage progress bar shows 1/1 sites executed. This displays the results collected so far. However, note that outside of this recipe, doing so risks not finding all dependencies (for example, when you have several calls of selected cycles).
- Rebuild the solution with the/DREDUCTIONoption. Re-run the Roofline analysis and save the result asSnapshot_xHost_Reduction.
- Observe that the loop inmatvecatMultiply.c:82is now vectorized. TheElapsed timeis also improved.
- Open theSnapshot_Baselineresult and, using the Roofline Compare feature, addSnapshot_NoAliasandSnapshot_xHost_Reductionfor comparison.
- The Roofline plot inIntel Advisorcan be used to visually represent application performance in relation to hardware limitations – memory bandwidth and computational peaks.
- Intel Advisor2019 has a new feature called Roofline Compare, which can be used to see the shift of loops and functions after each optimization effort. With this feature, the process of optimization becomes less challenging, as it helps developers to quantify and visualize their optimization efforts.