User Guide

Contents

Model MPI Application Performance on GPU

You can model your MPI application performance on a target graphics processing unit (GPU) device to determine whether you can get a performance speedup from offloading the application to the GPU.
Offload Modeling
perspective of the
Intel® Advisor
includes the following stages:
  1. Collecting the baseline performance data on a host device with the Survey, Characterization (Trip Counts, FLOP), and/or Dependencies analyses. You can collect data for one or more MPI ranks, where each rank corresponds to an MPI process.
  2. Modeling application performance on a target device with the Performance Modeling analysis. You can model performance only for one rank at a time. You can run performance modeling several times for different ranks analyzed to examine the potential performance difference between them, but the topic does not cover this case.

Model Performance of MPI Application

Prerequisite
: Set up environment variables to enable
Intel Advisor
CLI.
In the commands below:
  • Data is collected remotely to a
    shared
    directory.
  • The analyses are performed for an application running in
    four
    processes.
  • Path to an application executable is
    ./mpi_sample
    .
    Note
    : In the commands below, make sure to replace the application path and name
    before
    executing a command. If your application requires additional command line options, add them after the executable name.
  • Path to an
    Intel Advisor
    project directory is
    ./advi_results
    .
  • Performance is modeled for the default Intel® Arc™ graphics code-named Alchemist (
    xehpg_512xve
    configuration).
This example shows how to run
Offload Modeling
to model performance for the rank 1 of the MPI application. It uses the
gtool
option of the Intel MPI Library to collect performance data on a baseline CPU. For other collection options, see Analyze MPI Applications.
  1. Optional, but recommended
    : Generate preconfigured command lines for your application using the option. For example, generate the command lines using
    Intel Advisor
    CLI:
    advisor --collect=offload --dry-run --project-dir=./advi_results -- ./mpi_sample
    After you run it, a list of analysis commands to run the
    Offload Modeling
    for the specified accuracy level is printed to the terminal/command prompt. For the command above, the commands are printed for the default
    medium
    accuracy:
    advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./advi_results -- ./mpi_sample advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=light --target-device=xehpg_512xve --project-dir=./advi_results -- ./mpi_sample advisor --collect=projection --no-assume-dependencies --config=xehpg_512xve --project-dir=./advi_results
    You need to modify the printed commands for the MPI syntax to use an MPI launcher. See Analyze MPI Applications for syntax details.
  2. Collect survey data for the rank 1 into the shared
    ./advi_results
    project directory.
    mpirun -gtool "advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./advi_results:1" -n 4 ./mpi_sample
  3. Collect trip counts and FLOP data for the rank 1.
    mpirun -gtool "advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=light --target-device=xehpg_512xve --project-dir=./advi_results:1” -n 4 ./mpi_sample
  4. If you did not collect data to a shared location and need to copy the data to the local system to view the results, do it now.
  5. Model performance for the analyzed rank 1 of the MPI application that you ran the analyses for.
    advisor --collect=projection --config=xehpg_512xve --mpi-rank=1 --project-dir=./advi_results
    You can only model performance for one rank at a time. The results are generated for the rank specified in a corresponding
    ./advi_results/rank.1
    directory.
  6. If you did not collect data to a shared location and need to copy the data to the local system to view the results, do it now.
  7. On a local system, view the results with your preferred method.

Configure Performance Modeling for MPI Application

By default, Offload Modeling is optimized to model performance for a single-rank MPI application. For multi-rank MPI applications, you can apply additional configuration and settings to adjust the performance model for a specific hardware or application. You can adjust the number of MPI ranks to run per GPU tile and/or exclude MPI time from the report.
In the commands below:
  • Data is collected remotely to a
    shared
    directory.
  • The analyses are performed for an application running in
    four
    processes.
  • Path to an application executable is
    ./mpi_sample
    .
    Note
    : In the commands below, make sure to replace the application path and name
    before
    executing a command. If your application requires additional command line options, add them after the executable name.
  • Path to an
    Intel Advisor
    project directory is
    ./advi_results
    .
  • Performance is modeled for Intel® Arc™ graphics code-named Alchemist (
    xehpg_512xve
    configuration).

Change the Number of MPI Processes per GPU Tile

Prerequisite
: Set up environment variables to enable
Intel Advisor
CLI.
Families of Intel® X
e
graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see GPU Architecture Terminology for Intel® X
e
Graphics
.
By default,
Offload Modeling
assumes that one MPI process, or rank, is mapped to one GPU tile. You can configure the performance model to adjust the number of MPI ranks to run per GPU tile to match your target device configuration.
To do this, you need to set the number of tiles per MPI process by scaling the
Tiles_per_process
target device parameter in a command line or a TOML configuration file. If you want to model performance for the Intel® Arc™ graphics code-named Alchemist, which is
XeHPG 256
or
XeHPG 512
configuration in Offload Modeling targets, use the
Stack_per_process
parameter. The parameter sets a fraction of a GPU tile that runs a single MPI process. For example, if you want to offload your MPI application with 8 processes to a target GPU device with 4 tiles, you need to adjust the performance model to run 2 MPI processes per tile, or to use 0.5 tile per process.
The number of tiles per process you set automatically adjusts:
  • the number of execution units (EU)
  • SLM, L1, L3 sizes and bandwidth
  • memory bandwidth
  • PCIe* bandwidth
The parameter accepts values from 0.01 to 12.0. Consider the following value examples:
Tiles_per_process
/
Stack_per_process
Value
Number of MPI Ranks per Tile
1.0 (default)
1
12.0 (maximum)
1/12
0.25
4
0.125
8
To run the
Offload Modeling
with a custom tile-per-process parameter, you need to scale the parameter during the analysis. This change is one time and is applied only to the analysis you run it with. The commands below use the
Tiles_per_process
parameter for scaling. Replace it with
Stack_per_process
if needed.
  1. Generate pre-configured command lines for your application with the option to change the number of tiles per process. Use the
    --dry-run
    option of the script to generate commands to adjust cache configuration to the scaled parameter.
    For example, to generate commands for the
    ./advi_results
    project and model performance with 0.25 tiles per process, which corresponds to four MPI ranks per tile:
    advisor-python $APM/collect.py ./advi_results --set-parameter scale.Tiles_per_process=0.25 --dry-run -- ./mpi_sample
    After you run it, a list of analysis commands to run the
    Offload Modeling
    with the specified accuracy level is printed to the terminal/command prompt similar to the following:
    advisor --collect=survey --project-dir=./advi_results --static-instruction-mix -- ./mpi_sample advisor --collect=tripcounts --project-dir=./advi_results --flop --ignore-checksums --data-transfer=medium --stacks --profile-jit --cache-sources --enable-cache-simulation --cache-config=8:64w:4k/1:192w:768k/1:4w:2m -- ./mpi_sample python $APM/collect.py ./advi_results -m generic advisor --collect=dependencies --project-dir=./advi_results --filter-reductions --loop-call-count-limit=16 --ignore-checksums -- ./mpi_sample
  2. Copy the generated commands to your preferred text editor and modify them for the MPI-specific syntax. You need to add the following:
    • MPI launcher name and (optionally)
      gtool
      option for Intel® MPI Library
    • Number of MPI processes to launch
    • If you use
      gtool
      : MPI ranks to analyze
    See Analyze MPI Applications for syntax details.
    You can skip the mark-up and Dependencies analysis step (the last two commands) because they add high overhead. See Check How Assumed Dependencies Affect Modeling for details.
  3. Run the modified commands for Survey, Trip Counts, and (optionally) Dependensies analyses one by one. For example, to run Survey and Trip Counts for the rank 1:
    mpirun -gtool "advisor --collect=survey --static-instruction-mix -- ./mpi_sample --project-dir=./advi_results:1" -n 4 ./mpi_sample
    mpirun -gtool “advisor --collect=tripcounts --flop --ignore-checksums --data-transfer=medium --stacks --profile-jit --cache-sources --enable-cache-simulation --cache-config=8:64w:4k/1:192w:768k/1:4w:2m --project-dir=./advi_results:1” -n 4 ./mpi_sample
  4. Run the performance modeling with the number of tiles per MPI processes specified using the
    --set-parameters
    option. For example, to model performance for the rank 1:
    advisor --collect=projection --project-dir=./advi_results --set-parameter scale.Tiles_per_process=0.25 --mpi-rank=1
    Make sure to specify the same value for the
    --set-parameter scale.Tiles_per_process
    as for the dry-run step.
    The result is generated for the rank specified in a corresponding
    ./advi_results/rank.1
    directory. You can transfer them to the development system, if needed, and view the results.
When you open the result in the
Intel Advisor
GUI or an interactive HTML report, you should see the tiles per process or stack per process parameter in the
Modeling Parameters
pane with the value you set. The parameter is in a read-only format. Notice that tiles per process or stack per process parameter shows the value per process, while other parameters in the pane show the value per device.

Ignore MPI Time

Prerequisite
: Set up environment variables to enable
Intel Advisor
CLI.
For multi-rank MPI workloads, time spent in MPI runtime can differ from rank to rank, which may cause significant performance imbalance. Because of this, the whole application time and
Offload Modeling
results may be different from rank to rank. If MPI time is large and differs between ranks, and the MPI code does not include many computations, you can exclude time spent in MPI routines from the analysis so that it does not affect modeling results.
  1. Collect Survey, Trip Counts, and (optionally) Dependencies data for your application. See Analyze MPI Applications for details.
  2. Run the performance modeling with time in MPI calls ignored using the option.
    advisor --collect=projection --project-dir=./advi_results --ignore=MPI --mpi-rank=1
    The results are generated in a
    ./advi_results/rank.1
    directory. You can transfer them to the development system and view the results.
In the report generated, all
per-application
performance modeling metrics are calculated based on application self-time with time spent in MPI calls excluded from the analysis. This should improve modeling across ranks.
This option affects only metrics for
the whole program
in the
Summary
tab. Metrics for individual regions are not recalculated.

View Results

Intel Advisor
saves collection results into subdirectories for each rank analyzed under the project directory specified with
--project-dir
. The modeling results are available only for the ranks that you ran the Performance Modeling for, for example, as specified with the
--mpi-rank
option.
To view the performance or dependency results collected for a specific rank, you can do one of the following.
View Results in GUI
From the Intel Advisor GUI, open a result project file
*.advixeproj
that resides in the
<project-dir>
/rank.
<n>
directory.
You can also open the GUI from command line:
advisor-gui ./advi_results/rank.1
If you used
--no-auto-finalize
when collecting data, make sure to set paths to application binaries and sources
before
viewing the result so that
Intel Advisor
can finalize it properly.
View Results in Command Line
After you run the Performance Modeling analysis, the summary result of the modeling is printed to a terminal/command prompt. Examine the data to learn the estimated speedup and top five offloaded regions.
View Results in an Interactive HTML Report
Open an interactive
advisor-report
HTML report generated in the respective rank directory at
<project-dir>
/rank.
<n>
/e
<NNN>
/report
and a set of CSV reports in the respective rank directory at
<project-dir>
/rank.
<n>
/p
<NNN>
/data.0
.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.