MPI Code Analysis
- Application Performance Snapshot provides a quick MPI application performance overview.
- Intel Trace Analyzer and Collector explores message passing interface (MPI) usage efficiency with communication hotspots, synchronization bottlenecks, load balancing, etc.
- Intel VTunefocuses on intra-node performance with threading, memory, and vectorization efficiency metrics.Profiler
Configure Installation for MPI Analysis on Linux* Host
Configure MPI Analysis with the
- <is the number of MPI processes to be run.n>
- -loption of thempiexec/mpiruntools marks stdout lines with an MPI rank. This option is recommended, but not required.
- -quiet/-qoption suppresses the diagnostic output like progress messages. This option is recommended, but not required.
- -collect <is an analysis type you run with theanalysis type>VTune. To view a list of available analysis types, useProfilercommand.VTuneProfiler-help collect
- -trace-mpiadds a per-node suffix to the result directory name and adds a rank number to a process name in the result. This option is required for non-Intel MPI launchers.
- -result-dir <specifies the path to a directory in which the analysis results are stored.my_result>
mpirun -n 16 –ppn 4 –l vtune -collect hotspots -k sampling-mode=hw -trace-mpi -result-dir my_result -- my_app.a
my_result.host_name1 (rank 0-3) my_result.host_name2 (rank 4-7) my_result.host_name3 (rank 8-11) my_result.host_name4 (rank 12-15)
export VTUNE_CL=vtune -collect memory-access -trace-mpi -result-dir my_result mpirun -host myhost1 -n 7 my_app.a : -host myhost1 -n 1 $VTUNE_CL -- my_app.a :-host myhost2 -n 7 my_app.a : -host myhost2 -n 1 $VTUNE_CL -- my_app.a
# config.txt configuration file -host myhost1 -n 7 ./a.out -host myhost1 -n 1 vtune -quiet -collect memory-access -trace-mpi -result-dir my_result ./a.out -host myhost2 -n 7 ./a.out -host myhost2 -n 1 vtune -quiet -collect memory-access -trace-mpi -result-dir my_result ./a.out
mpirun -configfile ./config.txt
mpirun -gtool "vtune -collect memory-access -result-dir my_result:7,5" my_app.a
- This example runs the HPC Performance Characterization analysis type (based on the sampling driver), which is recommended as a starting point:mpirun -n 4 vtune -result-dir my_result -collect hpc-performance -- my_app [my_app_options]
- This example collects the Hotspots data (hardware event-based sampling mode) for two out of 16 processes run on myhost2 in the job distributed across the hosts:mpirun -host myhost1 -n 8 ./a.out : -host myhost2 -n 6 ./a.out : -host myhost2 -n 2 vtune -result-dir foo -c hotspots -k sampling-mode=hw ./a.outAs a result, theVTunecreates a result directory in the current directoryProfilerfoo.myhost2(given that process ranks 14 and 15 were assigned to the second node in the job).
- As an alternative to the previous example, you can create a configuration file with the following content:# config.txt configuration file -host myhost1 -n 8 ./a.out -host myhost2 -n 6 ./a.out -host myhost2 -n 2 vtune -quiet -collect hotspots -k sampling-mode=hw -result-dir foo ./a.outand run the data collection as:mpirun -configfile ./config.txtto achieve the same result as in the previous example:foo.myhost2result directory is created.
- This example runs the Memory Access analysis with memory object profiling for all ranks on all nodes:mpirun n 16 -ppn 4 vtune -r my_result -collect memory-access -knob analyze-mem-objects=true -my_app [my_app_options]
- This example runs Hotspots analysis (hardware event-based sampling mode) on ranks 1, 4-6, 10:mpirun –gtool "vtune -r my_result -collect hotspots -k sampling-mode=hw : 1,4-6,10" –n 16 -ppn 4 my_app [my_app_options]
Control Collection with Standard MPI_Pcontrol Function
- Pause data collection:MPI_Pcontrol(0)
- Resume data collection:MPI_Pcontrol(1)
- Exclude initialization phase: Use with theVTuneProfiler-start-pausedoption by adding theMPI_Pcontrol(1)call right after initialization code completion. Unlike with ITT API calls, using theMPI_Pcontrolfunction to control data collection does not require a link to a profiled application with a static ITT API library and therefore changes in the build configuration of the application.
Resolve Symbols for MPI Modules
mpirun -np 128 vtune -q -collect hotspots -search-dir /home/foo/syms ./a.out
View Collected Data
- TheOnly user functionscall stack mode attributes the time spent in the MPI calls to the user functionfoo()so that you can see which of your functions you can change to actually improve the performance.
- The defaultUser functions+1mode attributes the time spent in the MPI implementation to the top-level system function -MPI_Bar()so that you can easily see outstandingly heavy MPI calls.
- TheUser/system functionsmode shows the call tree without any re-attribution so that you can see where exactly in the MPI library the time was spent.
vtune -R hotspots -group-by process,function -r result_dir.host1
vtune -R hotspots -group-by process,module -r result_dir.host2
MPI Implementations Support
- Linux* only: Based on thePMI_RANKorPMI_IDenvironment variable (whichever is set), theVTuneextends a process name with the captured rank number that is helpful to differentiate ranks in aProfilerVTuneresult with multiple ranks. The process naming schema in this case isProfiler<. To enable detecting an MPI rank ID for MPI implementations that do not provide the environment variable, use theprocess_name> (rank <N>)-trace-mpioption.
- For the Intel MPI library, theVTuneclassifies MPI functions/modules as system functions/modules (theProfilerUser functions+1option) and attributes their time to system functions. This option may not work for all modules and functions of non-Intel MPI implementations. In this case, theVTunemay display some internal MPI functions and modules by default.Profiler
- You may need to adjust the command line examples in this help section to work for non-Intel MPI implementations. For example, you need to adjust command lines provided for different process ranks to limit the number of processes in the job.
- An MPI implementation needs to operate in cases when there is theVTuneprocess (Profiler) between the launcher process (vtunempirun/mpiexec) and the application process. It means that the communication information should be passed using environment variables, as most MPI implementations do.VTunedoes not work on an MPI implementation that tries to pass communication information from its immediate parent process.Profiler
MPI System Modules Recognized by the
- VTune Amplifies does not support MPI dynamic processes (for example, theMPI_Comm_spawndynamic process API).