Collect MPI Performance/Correctness Data
To collect performance or correctness data for an MPI application with the Intel® VTune™ Profiler / Intel Inspector on a Windows* or Linux* OS, the following command should be used:
$ mpirun-n <N> <abbr>-cl -r my_result -collect <analysis type> my_app [my_app_ options]
where <abbr> is amplxe or inspxe respectively. The list of analysis types available can be viewed using amplxe-cl-help collect command.
As a result of using the collection commands, a number of result directories are created in the current directory, named as my_result.0 - my_result.3. The numeric suffix is the corresponding MPI process rank that is detected and captured by the collector automatically. The usage of the suffix makes sure that multiple amplxe-cl / inspxe-cl instances launched in the same directory on different nodes do not overwrite the data of each other and can work in parallel. So, a separate result directory is created for each analyzed process in the job.
Sometimes it is necessary to collect data for a subset of the MPI processes in the workload. In this case the per-host syntax of mpirun/mpiexec* should be used to specify different command lines to execute for different processes.
When launching the collection on Windows OS, we recommend passing the -genvall option to the mpiexec tool to make sure that the user environment variables are passed to all instances of the profiled process. Otherwise, by default the processes are launched in the context of a system account and some environment variables (USERPROFILE, APPDATA) do not point where the tools expect them to point to.
There are also some specialties about stdout / stdin behavior in MPI jobs profiled with the tools:
It is recommended to pass the -quiet / -q option to amplxe-cl / inspxe-cl to avoid diagnostic output like progress messages being spilled to the console by every tool process in the job.
The user may want to use the -l option for mpiexec/mpirun to get stdout lines marked with MPI rank.
The most reasonable analysis type to start with for the Intel VTune Profiler is hotspots, so an example of full command line for collection would be:
$ mpirun-n 4 amplxe-cl -r my_result -collect hotspots -- my_app [my_app_ options]
A similar command line for the Intel Inspector and its ti1/mi1 analysis types (the lowest overhead threading and memory correctness analysis types respectively) would look like:
$ mpirun-n 4 inspxe-cl -r my_result -collect mi1 -- my_app [my_app_ options]
$ mpirun-n 4 inspxe-cl -r my_result -collect ti1 -- my_app [my_app_ options]
Here is an example where there are 16 processes in the job distributed across the hosts and hotspots data should be collected for only two of them:
$ mpirun-host myhost -n 14 ./a.out : -host myhost -n 2 amplxe-cl -r foo -c hotspots ./a.out
As a result, two directories will be created in the current directory: foo.14 and foo.15 (given that process ranks 14 and 15 were assigned to the last 2 processes in the job). As an alternative to specifying the command line above, it is possible to create a configuration file with the following content:
# config.txt configuration file -host myhost -n 14 ./a.out -host myhost -n 2 amplxe-cl -quiet -collect hotspots -r foo ./a.out
and run the data collection as:
$ mpirun-configfile ./config.txt
to achieve the same result as above (foo.14 and foo.15 result directories will be created). Similarly, you can use specific host names to control where the analyzed processes are executed:
# config.txt configuration file -host myhost1 -n 14 ./a.out -host myhost2 -n 2 amplxe-cl -quiet -collect hotspots -r foo ./a.out
When the host names are mentioned, consecutive MPI ranks are allocated to the specified hosts. In the case above, ranks 0 to 13, inclusive, will be assigned to myhost1, the remaining ranks 14 and 15 will be assigned to myhost2. On Linux, it is possible to omit specifying the exact hosts, in which case the distribution of the processes between the hosts will be done in round-robin fashion. That is, myhost1 will get MPI ranks 0, 2, and 4 thru 15, while myhost2 will get MPI ranks 1 and 3. The latter behavior may change in the future.
In the examples this reference uses the mpirun command as opposed to mpiexec and mpiexec.hydra while real-world jobs might use the mpiexec* ones. mpirun is a higher-level command that dispatches to mpiexec or mpiexec.hydra depending on the current default and options passed. All the examples listed in the paper work for the mpiexec* commands as well as the mpirun command.