User Guide

Contents

Model MPI Application Performance on GPU

You can model your MPI application performance on an accelerator to determine whether it can benefit from offloading to a target device.
For MPI applications, you can collect data only with
advisor
command line interface (CLI).
  1. Optional: Generate pre-configured command lines for your application:
    1. Run the Offload collection in a dry-run mode to generate the command lines:
      advisor --collect=offload --dry-run --project-dir=
      <project-dir>
      -- ./myApplication [
      <application-options>
      ]
      It will print a list of commands for each analysis step necessary to get Offload Modeling result with the specified accuracy level (for the commands above, it is low).
      For dry-run accuracy levels and other ways to generate the commands, see Optional: Generate pre-Configured Command Lines.
    2. Copy the generated commands to your preferred text editor and modify them for the MPI-specific syntax. See Analyze MPI Applications for details about the syntax.
  2. Run the Intel Advisor analyses to collect metrics for your application running on a host device with
    advisor
    CLI. For example, the full collection workflow with the Intel® MPI Library
    gtool
    with
    mpiexec
    :
    mpiexec –gtool “advisor --collect=survey --static-instruction-mix --project-dir=<project-dir>:<ranks-set>” -n <N> ./myApplication [
    <application-options>
    ]
    mpiexec –gtool “advisor --collect=tripcounts --flop --enable-cache-simulation --target-device=<target-gpu> --stacks --data-transfer=light --project-dir=<project-dir>:<ranks-set>” -n <N> ./myApplication [
    <application-options>
    ]
    mpiexec –gtool “advisor --collect=dependencies --select markup=gpu_generic --loop-call-count-limit=16 --project-dir=<project-dir>:<ranks-set>” -n <N> ./myApplication [
    <application-options>
    ]
    where:
    • <ranks-set>
      is the set of MPI ranks to run the analysis for. Separate ranks with a comma, or use a dash "-" to set a range of ranks. Use
      all
      to analyze all the ranks.
    • <N>
      is the number of MPI processes to launch.
    • <target-gpu>
      is a GPU configuration to model cache for. See --target-device for available configurations.
  3. Model performance of your application on a target device for a single rank with one of the following:
    • With the
      advisor
      CLI:
      advisor --collect=projeciton --mpi-rank=<n> --config=<target-gpu> --project-dir=<project-dir>
    • With the
      analyze.py
      script:
      advisor-python <APM>/analyze.py <project-dir> --mpi-rank <n> --config <target-gpu>
    where:
    • <APM>
      is an environment variable for a path to
      Offload Modeling
      scripts. For Linux* OS, replace it with
      $APM
      , for Windows* OS, replace it with
      %APM%
      .
    • <n>
      is the rank number to model performance for.
      Instead of
      --mpi-rank=<n>
      , you can specify path to rank folder in the project directory. This is only supported by the
      analyze.py
      script:
      advisor-python <APM>/analyze.py <project-dir>/rank.<n> [--options]
    • <target-gpu>
      is a GPU configuration to model the application performance for. Make sure to specify the same target as for the Trip Counts step. See --config for available configurations.

View the Results

For
Offload Modeling
, the reports are generated automatically after you run performance modeling. You can either open a result project file (
*.advixeproj
) located in the
<project-dir>
using the
Intel Advisor
GUI or view an HTML/CSV report in the respective rank directory at
<project-dir>
/rank.
<n>
/e
<NNN>
/pp
<NNN>
/data.0
.

Model Performance for Multi-Rank MPI

By default,
Offload Modeling
is optimized to model performance for a single-rank MPI application. For multi-rank MPI applications, do
one
of the following:
Scale Target Device Parameters
By default,
Offload Modeling
assumes that one MPI process is mapped to one GPU tile. You can configure the performance model and map MPI ranks to a target device configuration. To do this, you need to set the number of tiles per MPI process by scaling the
Tiles_per_process
target device parameter in a command line or a TOML configuration file. The parameter sets a fraction of a GPU tile that corresponds to a single MPI process and accepts values from 0.01 to 12.0.
The number of tiles per process you set automatically adjusts:
  • the number of execution units (EU)
  • SLM, L1, L3 sizes and bandwidth
  • memory bandwidth
  • PCIe* bandwidth
Consider the following value examples:
Tiles_per_process
Value
Number of MPI Ranks per Tile
1 (default)
1
12 (maximum)
1/12
0.25
4
0.125
8
Info
: In the commands below, make sure to replace the
myApplication
with your application executable path and name
before
executing a command. If your application requires additional command line options, add them
after
the executable name.
To run the
Offload Modeling
with a scaled tile-per-process parameter:
Method 1
. Scale the parameter during the analysis. This is a one-time change applied only to the analysis you run it with.
  1. Run the script in the dry-run mode to generate commands lines with the cache configuration adjusted to the specified number of tiles per process. For example, to generate commands for the
    ./advi_results
    project and model performance with 0.25 tiles per process, which corresponds to four MPI ranks per tile:
    advisor-python $APM/collect.py ./advi_results --dry-run --set-parameter scale.Tiles_per_process=0.25 -- ./myApplication
    You can specify any value from 0.01 to 12.0 for the
    scale.Tiles_per_process
    parameter.
    This command generates a set of command lines for the Offload Modeling workflow that runs the collection with the
    advisor
    CLI with parameters adjusted for the configuration.
  2. Copy the generated commands to your preferred text editor and modify them for the MPI-specific syntax. See the list above for command templates.
  3. Optional
    : If you have not collected performance data for your application, run the Survey analysis using the generated and modified Survey command.
  4. From your text editor, copy the modified command for the Trip Counts analysis and run it from the shell. For example, the command from the previous step should look as follows if run for the Intel® MPI Library:
    mpiexec –gtool “advisor --collect=tripcounts --project-dir=./advi_results --flop --ignore-checksums --data-transfer=medium --stacks --profile-jit --cache-sources --enable-cache-simulation --cache-config=8:1w:4k/1:192w:3m/1:16w:8m” -n 4 ./myApplication
    This command adjusts metrics for the new cache configuration.
  5. Run the performance modeling for
    one
    MPI rank with the number of tiles per MPI processed specified. For example, with the
    advisor
    CLI for the MPI rank 4:
    advisor --collect=projection --project-dir=./advi_results --set-parameter scale.Tiles_per_process=0.25 --mpi-rank=4
    Make sure to specify the same value for the
    --set-parameter scale.Tiles_per_process
    as for the Trip Counts step.
    The report for the specified MPI rank will be generated in the project directory. Proceed to view the results.
Method 2
. Create a custom configuration file to use with any device configuration.
  1. Scale the parameter with one of the following:
    • Create a TOML file, for example,
      my_config.toml
      . Specify the parameter as follows:
      [scale] Tiles_per_process = <float>
      where
      <float>
      is a fraction of a GPU tile that corresponds to a single MPI process.
    • Use scalers in the
      legacy
      Offload Modeling HTML report:
      1. Run the Performance Modeling for your application without scaling.
      2. Go to
        <project-dir>
        /rank.
        <N>
        /e<NNN>/report/
        and open the legacy Offload Modeling HTML report
        report.html
        .
      3. In the Summary tab, set the
        MPI Tile per Process
        scaler in the configuration pane to a desired value.
      4. Click the
        Download configuration file
        icon to save the current configuration as
        scalers.toml
  2. Re-run the performance modeling with the custom TOML file. For example, with
    my_config.toml
    :
    advisor --collect=projection --config=gen12_tgl --custom-config=./my_config.toml --mpi-rank=4 --project-dir=./advi_results
The report for the specified MPI rank will be generated in the project directory. Proceed to view the results.
Ignore MPI Time
For multi-rank MPI workloads, time spent in MPI runtime can differ from rank to rank and cause differences in the whole application time and
Offload Modeling
projections. If MPI time is significant and you see the differences between ranks, you can exclude time spent in MPI routines from the analysis.
  1. Collect performance data for your application using the
    advisor
    CLI.
  2. Run the performance modeling with time in MPI calls ignored using the
    --ignore
    option. For example, with the
    advisor
    CLI:
    advisor --collect=projection --project-dir=./advi_results --ignore=MPI --mpi-rank=4
In the report generated, all
per-application
performance modeling metrics are re-calculated based on application self time excluding time spent in MPI calls from the analysis. This should improve modeling across ranks.
This option affects only metrics for a whole program in the
Summary
tab. Metrics for individual regions are not recalculated.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.