Examine Regions Recommended for Offloading
- There is no tax for transferring data between baseline and target platforms.
- All data goes to L1/L3 cache level only. L1/L3 cache traffic estimation might be inaccurate.
- A loop is parallel if the loop dependency type in unknown (Assume Dependenciescheckbox is disabled). This happens when there is no information about a loop dependency type from a compiler or the loop is not explicitly marked as parallel, for example, with a programming model (OpenMP*, Data Parallel C++,Intel® oneAPI Threading Building Blocks(oneTBB))
- Review the metrics for the whole application in theSummarytab.
- Check if your application is profitable to offload to a target device or if it has a better performance on a baseline platform in theProgram Metricspanes.
- See what prevents your code from achieving a better performance if executed on a target device in theOffload Bounded bypane.If you enableAssume Dependenciesoption for thePerformance Modelinganalysis, you might see high percentage of dependency-bound code regions. You are recommended run the Dependencies analysis and rerun Performance Modeling to get more accurate results.
- If the estimated speed-up is high enough and other metrics in theSummarypane suggest that your application can benefit from offloading to a selected target platform, you can start offloading your code.
- If you want to investigate the results reported for each region in more detail, go to theAccelerated Regionstab and select a code region:
- Check whether your target code region is recommended for offloading to a selected platform. In theBasic Estimated Metricscolumn group, review theOffload Summarycolumn. The code region is considered profitable for offloading if estimated speed-up is more than 1, that is, estimated time execution on a target device is smaller that on a host platform.If your code region of interest is not recommended for offloading, consider re-running the perspective with a higher accuracy or refer to Investigate Not Offloaded Code Regions for recommendations on how to model offloading for this code region.
- Examine theBounded Bycolumn of theEstimated Bounded Bygroup to identify the main bottleneck that limits the performance of the code region. See Bounded By for details about metric interpretation.
- In theThroughputcolumn of theEstimated Bounded Bygroup, review time spent for compute- and L3 cache bandwidth-bound parts of your code. If the value is high, consider optimizing compute and/or L3 cache usage in your application.
- Review the metrics in theCompute Estimatescolumn group to see the details about instructions and number of threads used in each code region.
- Get guidance for offloading your code to a target device and optimizing it so that your code benefits the most in theRecommendationstab. If the code region has room for optimization or underutilizes the capacity of the target device,Intel Advisorprovides you with hints and code snippets that may be helpful to you for further code improvement.
- View the offload summary and details for the selected code region in theDetailspane.
- If you think that the estimated speedup is enough and the application is ready to be offloded, rewrite your code to offload profitable code regions to a target platform and measure performance of GPU kernels withGPU Roofline Insightsperspective.
- Consider running theOffload Modelingperspective with a higher accuracy level to get a more detailed report.