Intel® Advisor User Guide

ID 766448
Date 10/31/2024
Public
Document Table of Contents

Model MPI Application Performance on GPU

You can model your MPI application performance on a target graphics processing unit (GPU) device to determine whether you can get a performance speedup from offloading the application to the GPU.

Offload Modeling perspective of the Intel® Advisor includes the following stages:

  1. Collecting the baseline performance data on a host device with the Survey, Characterization (Trip Counts, FLOP), and/or Dependencies analyses. You can collect data for one or more MPI ranks, where each rank corresponds to an MPI process.
  2. Modeling application performance on a target device with the Performance Modeling analysis. You can model performance only for one rank at a time. You can run performance modeling several times for different ranks analyzed to examine the potential performance difference between them, but the topic does not cover this case.

Model Performance of MPI Application

Prerequisite: Set up environment variables[/topic/body/section/p/xref/desc {"- topic/desc "}) Use this topic to get guidance on setting up environment variables for Intel® Advisor. (desc] to enable Intel Advisor CLI.

In the commands below:

  • Data is collected remotely to a shared directory.
  • The analyses are performed for an application running in four processes.
  • Path to an application executable is ./mpi_sample.

    Note: In the commands below, make sure to replace the application path and name before executing a command. If your application requires additional command line options, add them after the executable name.

  • Path to an Intel Advisor project directory is ./advi_results.
  • Performance is modeled for the default Intel® Arc™ graphics code-named Alchemist (xehpg_512xve configuration).

This example shows how to run Offload Modeling to model performance for the rank 1 of the MPI application. It uses the gtool option of the Intel® MPI Library to collect performance data on a baseline CPU. For other collection options, see Analyze MPI Applications[/topic/body/section/p/xref/desc {"- topic/desc "}) With Intel® Advisor, you can analyze parallel tasks running on a cluster to examine performance of your MPI application. (desc].

  1. Optional, but recommended: Generate preconfigured command lines for your application using the --dry-run[/topic/body/section/ol/li/codeph/xref/desc {"- topic/desc "}) List all steps included in Offload Modeling collection preset at a specified accuracy level without running them. (desc] option. For example, generate the command lines using Intel Advisor CLI:
    advisor --collect=offload --dry-run --project-dir=./advi_results -- ./mpi_sample

    After you run it, a list of analysis commands to run the Offload Modeling for the specified accuracy level is printed to the terminal/command prompt. For the command above, the commands are printed for the default medium accuracy:

    advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./advi_results -- ./mpi_sample
     advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=light --target-device=xehpg_512xve --project-dir=./advi_results -- ./mpi_sample
     advisor --collect=projection --no-assume-dependencies --config=xehpg_512xve --project-dir=./advi_results

    You need to modify the printed commands for the MPI syntax to use an MPI launcher. See Analyze MPI Applications[/topic/body/section/ol/li/p/xref/desc {"- topic/desc "}) With Intel® Advisor, you can analyze parallel tasks running on a cluster to examine performance of your MPI application. (desc] for syntax details.

  2. Collect survey data for the rank 1 into the shared ./advi_results project directory.
    mpirun -gtool "advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./advi_results:1" -n 4 ./mpi_sample
  3. Collect trip counts and FLOP data for the rank 1.
    mpirun -gtool "advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=light --target-device=xehpg_512xve --project-dir=./advi_results:1" -n 4 ./mpi_sample
  4. If you did not collect data to a shared location and need to copy the data to the local system to view the results, do it now.
  5. Model performance for the analyzed rank 1 of the MPI application that you ran the analyses for.
    advisor --collect=projection --config=xehpg_512xve --mpi-rank=1 --project-dir=./advi_results

    You can only model performance for one rank at a time. The results are generated for the rank specified in a corresponding ./advi_results/rank.1 directory.

  6. If you did not collect data to a shared location and need to copy the data to the local system to view the results, do it now.
  7. On a local system, view the results with your preferred method.

Configure Performance Modeling for MPI Application

By default, Offload Modeling is optimized to model performance for a single-rank MPI application. For multi-rank MPI applications, you can apply additional configuration and settings to adjust the performance model for a specific hardware or application. You can adjust the number of MPI ranks to run per GPU tile and/or exclude MPI time from the report.

In the commands below:

  • Data is collected remotely to a shared directory.
  • The analyses are performed for an application running in four processes.
  • Path to an application executable is ./mpi_sample.

    Note: In the commands below, make sure to replace the application path and name before executing a command. If your application requires additional command line options, add them after the executable name.

  • Path to an Intel Advisor project directory is ./advi_results.
  • Performance is modeled for Intel® Arc™ graphics code-named Alchemist (xehpg_512xve configuration).

Change the Number of MPI Processes per GPU Tile

Prerequisite: Set up environment variables[/topic/body/section/p/xref/desc {"- topic/desc "}) Use this topic to get guidance on setting up environment variables for Intel® Advisor. (desc] to enable Intel Advisor CLI.

NOTE:
Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see GPU Architecture Terminology for Intel® Xe Graphics.

By default, Offload Modeling assumes that one MPI process, or rank, is mapped to one GPU tile. You can configure the performance model to adjust the number of MPI ranks to run per GPU tile to match your target device configuration.

To do this, you need to set the number of tiles per MPI process by scaling the Tiles_per_process target device parameter in a command line or a TOML configuration file. If you want to model performance for the Intel® Arc™ graphics code-named Alchemist, which is XeHPG 256 or XeHPG 512 configuration in Offload Modeling targets, use the Stack_per_process parameter. The parameter sets a fraction of a GPU tile that runs a single MPI process. For example, if you want to offload your MPI application with 8 processes to a target GPU device with 4 tiles, you need to adjust the performance model to run 2 MPI processes per tile, or to use 0.5 tile per process.

The number of tiles per process you set automatically adjusts:

  • the number of execution units (EU)
  • SLM, L1, L3 sizes and bandwidth
  • memory bandwidth
  • PCIe* bandwidth

The parameter accepts values from 0.01 to 12.0. Consider the following value examples:

Tiles_per_process/S