Tutorial: Analyzing OpenMP* and MPI Applications

ID 773235
Date 5/20/2020
Public

Summary

You have completed the Analyzing OpenMP* and MPI Applications tutorial with Application Performance Snapshot, Intel® Trace Analyzer and Collector, and Intel® VTune™ Profiler. Here are some important things to remember when working with your own hybrid application:

Step

Tutorial Recap

Key Tutorial Take-Aways

1. Build and configure application

You made sure all of the relevant tools were installed. You built the application and tested running the application with various process/thread combinations to determine optimization opportunities.

Test various combinations of MPI processes and OpenMP threads for your hybrid application. Different combinations can produce very different performance results for the same application.

2. Get a performance overview with Application Performance Snapshot

You ran the heart_demo application with the -aps option to collect load balance information, memory and disk usage information, and other metrics.

Use the Application Performance Snapshot HTML report to review where your application is inefficient and determine which tool to use next.

3. Identify communication issues with Intel Trace Analyzer and Collector

You ran the application with the -trace option to understand MPI library wait times and communication patterns. You reviewed the results using the Message Profile chart and identified communication issues.

  • An application can be MPI-bound due to high MPI library wait times, active communications, or poor library optimization settings.

  • When evaluating an application with Intel Trace Analyzer and Collector, it is not always obvious where the problem area is. Examine the application and all available charts closely to find issues.

  • Use the Message Profile chart to view the intensity of point-to-point communications for each sender-receiver pair.

4. Tune MPI-bound code

You optimized the application by applying the Cuthill-McKee algorithm for reordering a mesh before performing calculations. You used Intel Trace Analyzer and Collector and Application Performance Snapshot to confirm the performance improvement.

After completing an optimization, it is beneficial to check the performance of the best MPI process and OpenMP thread combinations again to see if there has been any change. Run the application without any analysis software to get an accurate elapsed time.

5. Analyze vector instruction set with Intel VTune Profiler

You ran a performance analysis on the heart_demo application using Intel VTune Profiler on the thread suggested by the Application Performance Snapshot report. You updated to the latest vector instruction set.

Using legacy vector instruction sets can lead to inefficient application performance. Be sure to use the latest vector instruction sets for your application.

6. Analyze serial and parallel code efficiency with Intel VTune Profiler

You reviewed issues with parallelism using Intel VTune Profiler. You updated the sample code to fix problem functions. You reviewed the process/thread combinations and observed efficiency improvements.

Review the Bottom-up tab in Intel VTune Profiler to find sections of your application that would benefit from threading and explore threaded code efficiency.