User Guide

Contents

Run
Offload Modeling
Perspective from Command Line

Intel® Advisor
provides several methods to run the
Offload Modeling
perspective from command line. Use one of the following:
  • Method 1
    . Run
    Offload Modeling
    with a command line collection presets
    . Use this method if you want to use basic
    Intel Advisor
    analysis and modeling functionality, especially for the firs-run analysis. This simple method allows you to run multiple analyses with a single command and control the modeling accuracy.
  • Method 2
    . Run
    Offload Modeling
    analyses separately
    . Use this method if you want to analyze an MPI application or need more advanced analysis customization. This method allows you to select what performance data you want to collect for your application and configure each analysis separately.
  • Method 3
    . Run
    Offload Modeling
    with Python* scripts
    . Use this method if you need more analysis customization. This method is moderately flexible and allows you to customize data collection and performance modeling steps.

Prerequisites

  1. Set
    Intel Advisor
    environment variables
    with an automated script.
    The script enables the
    advisor
    command line interface (CLI),
    advisor-python
    command line tool, and the
    APM
    environment variable, which points to the directory with
    Offload Modeling
    scripts and simplifies their use.
  2. For Data Parallel C++ (DPC++), OpenMP* target, OpenCL™ applications: Set up environment variables to offload temporarily your application to a CPU for the analysis.

Optional: Generate pre-Configured Command Lines

With the
Intel Advisor
, you can generate pre-configured command lines for your application and hardware. Use this feature if you want to:
  • Analyze an MPI application
  • Customize pre-set
    Offload Modeling
    commands
Offload Modeling
perspective consists of multiple analysis steps executed for the same application and project. You can configure each step from scratch or use pre-configured command lines that do not require you to provide the paths to project directory and an application executable manually.
Option 1
. Generate pre-configured command lines with
--collect=offload
and the
--dry-run
option. The option generates:
  • Commands for the
    Intel Advisor
    CLI collection workflow.
  • Commands that correspond to a specified accuracy level.
  • Commands
    not
    configured to analyze an MPI application. You need to manually adjust the commands for MPI.
Info
: In the commands below, make sure to replace the
myApplication
with your application executable path and name
before
executing a command. If your application requires additional command line options, add them
after
the executable name.
The workflow includes the following steps:
  1. Generate the command using the
    --dry-run
    option of
    --collect=offload
    . Specify accuracy level and paths to your project directory and application executable.
    For example, to generate the low-accuracy commands for the
    myApplication
    application executable, run the following command:
    • On Linux* OS:
      advisor --collect=offload --accuracy=low --dry-run --project-dir=./advi_results -- ./myApplication
    • On Windows* OS:
      advisor --collect=offload --accuracy=low --dry-run --project-dir=./advi_results -- myApplication.exe
    It will print a list of commands for each analysis step necessary to get
    Offload Modeling
    result with the specified accuracy level (for the commands above, it is
    low
    ).
  2. If you analyze an MPI application
    : Copy the generated commands to your preferred text editor and modify each command to use an MPI tool. For details about the syntax, see Analyze MPI Applications.
  3. Run the generated commands one by one from a command prompt or a terminal.
For details about MPI application analysis with
Offload Modeling
, see Model MPI Application Performance on GPU.
Option 2
. If you have an
Intel Advisor
graphical user interface (GUI) available on your system and you want to analyze an MPI application from command line, you can generate the pre-configured command lines from GUI.
The GUI generates:
  • Commands for the
    Intel Advisor
    CLI collection workflow.
  • Commands for a selected accuracy level if you want to run a pre-defined accuracy level or commands for a custom project configuration if you want to enable/disable additional analysis options.
  • Command configured for MPI application with Intel® MPI Library. You do not need to manually modify the commands for the MPI application syntax.
For detailed instructions, see Generate Command Lines from GUI.

Method 1. Use Collection Presets

For the
Offload Modeling
perspective,
Intel Advisor
has a special collection mode
--collect=offload
that allows you to run the perspective analyses using
only one
Intel Advisor
CLI command. When you run the collection, it sequentially runs data collection and performance modeling steps. The specific analyses and options depend on the
accuracy level
you specify for the collection.
Info
: In the commands below, make sure to replace the
myApplication
with your application executable path and name
before
executing a command. If your application requires additional command line options, add them
after
the executable name.
For example, to run the
Offload Modeling
perspective with the default (medium) accuracy level:
  • On Linux* OS:
    advisor --collect=offload --project-dir=./advi_results -- ./myApplication
  • On Windows* OS:
    advisor --collect=offload --project-dir=./advi_results -- myApplication.exe
The collection progress and commands for each analysis executed will be printed to a terminal or a command prompt. When the collection is finished, you will see the result summary.
Analysis Details
To change the analyses to run and their option, you can specify a different accuracy level with the
--accuracy=
<level>
option. The default accuracy level is
medium
.
The following accuracy levels are available:
  • low
    accuracy includes Survey, Characterization with Trip Counts and FLOP collections, and Performance Modeling analyses.
  • medium
    (default) accuracy includes Survey, Characterization with Trip Counts and FLOP collections, cache and data transfer simulation, and Performance Modeling analyses.
  • high
    accuracy includes Survey, Characterization with Trip Counts and FLOP collections, cache, data transfer, and memory object attribution simulation, Dependencies, and Performance Modeling analyses.
For example, to run the low accuracy level:
advisor --collect=offload --accuracy=low --project-dir=./advi_results -- myApplication.exe
To run the high accuracy level:
advisor --collect=offload --accuracy=high --project-dir=./advi_results -- myApplication.exe
If you want to see the commands that are executed at each accuracy level, you can run the collection with the
--dry-run
option. The commands will be printed to a terminal or a command prompt.
For details about each accuracy level, see Offload Modeling Accuracy Levels in Command Line.
Customize Collection
You can also specify additional options if you want to run the
Offload Modeling
with custom configuration. This collection accepts most options of the Performance Modeling analysis (
--collect=projection
) and some options of the Survey, Trip Counts, and Dependencies analyses that can be useful for the
Offload Modeling
.
Make sure to specify the additional options
after
the
--accuracy
option to make sure they take precedence over the accuracy level configurations.
Consider the following action options:
Option
Description
--accuracy=
<level>
Set an accuracy level for a collection preset. Available accuracy levels:
  • low
  • medium
    (default)
  • high
--config
Select a target GPU configuration to model performance for. For example,
gen11_icl
(default),
gen12_dg1
, or
gen9_gt3
.
See config for a full list of possible values and mapping to device names.
--gpu
Analyze a Data Parallel C++ (DPC++), OpenCL™, or OpenMP* target application on a graphics processing unit (GPU) device. This option automatically adds all related options to each analysis included in the preset.
For details about this workflow, see Run GPU-to-GPU Performance Modeling from Command Line.
--data-reuse-analysis
Analyze potential data reuse between code regions. This option automatically adds all related options to each analysis included in the preset.
--enforce-fallback
Emulate data distribution over stacks if stacks collection is disabled. This option automatically adds all related options to each analysis included in the preset.
For details about other available options, see collect.

Method 2. Use per-Analysis Collection

You can collect data and model performance for your application by running each
Offload Modeling
analysis in a separate command using
Intel Advisor
CLI. This option allows you to:
  • Control what analyses you want to run to profile your application and what data you want to collect.
  • Modify behavior of each analysis you run with an extensive set of options.
  • Re-model application performance without re-collecting performance data. This can save time if you want to see how the performance of your application might change with different modeling parameters using the same performance data as baseline.
  • Profile and model performance of MPI applications.
Consider the following workflow example. Using this example, you can run the Survey, Trip Counts, and FLOP analyses to profile an application and the Performance Modeling to model its performance on a selected target device.
Info
: In the commands below, make sure to replace the
myApplication
with your application executable path and name
before
executing a command. If your application requires additional command line options, add them
after
the executable name.
On Linux OS:
  1. Run the Survey analysis.
    advisor --collect=survey --static-instruction-mix --project-dir=./advi_results -- ./myApplication
  2. Run the Trip Counts and FLOP analyses with data transfer simulation for Intel® Iris® Xe MAX graphics (
    gen12_dg1
    configuration).
    advisor --collect=tripcounts --flop --enable-cache-simulation --target-device=gen12_dg1 --stacks --data-transfer=light --project-dir=./advi_results -- ./myApplication
  3. Run the Performance Modeling analysis to model application performance on Intel® Iris® Xe MAX graphics.
    advisor --collect=projection --config=gen12_dg1 --project-dir=./advi_results
    You will see the result summary printed to the command prompt.
For more useful options, see the
Analysis Details
section below.
On Windows OS:
  1. Run the Survey analysis.
    advisor --collect=survey --static-instruction-mix --project-dir=./advi_results -- myApplication.exe
  2. Run the Trip Counts and FLOP analyses with data transfer simulation for Intel® Iris® Xe MAX graphics (
    gen12_dg1
    configuration).
    advisor --collect=tripcounts --flop --enable-cache-simulation --target-device=gen12_dg1 --stacks --data-transfer=light --project-dir=./advi_results -- myApplication.exe
  3. Run the Performance Modeling analysis to model application performance on Intel® Iris® Xe MAX graphics.
    advisor --collect=projection --config=gen12_dg1 --project-dir=./advi_results
    You will see the result summary printed to the command prompt.
For more useful options, see the
Analysis Details
section below.
Analysis Details
The
Offload Modeling
workflow includes the following analyses:
  1. Survey to collect initial performance data.
  2. Characterization with trip counts and FLOP to collect performance details.
  3. Dependencies (optional) to identify loop-carried dependencies that might limit offloading.
  4. Performance Modeling to model performance on a selected target device.
Each analysis has a set of additional options that modify its behavior and collect additional performance data. The more analyses you run and option you use, the higher the modeling accuracy.
Consider the following options:
Survey Options
To run the Survey analysis, use the following command line action:
--collect=survey
.
Recommended action options:
Options
Description
--static-instruction-mix
Collect static instruction mix data. This option is recommended for the
Offload Modeling
perspective.
--profile-gpu
Analyze a DPC++, OpenCL, or OpenMP target application on a GPU device.
For details about this workflow, see Run GPU-to-GPU Performance Modeling from Command Line.
Characterization Options
To run the Characterization analysis, use the following command line action:
--collect=tripcounts
.
Recommended action options:
Options
Description
--flop
Collect data about floating-point and integer operations, memory traffic, and mask utilization metrics for AVX-512 platforms.
--stacks
Enable advanced collection of call stack data.
--enable-cache-simulation
Enable modeling cache behavior for a target device. Make sure to use with the
--target-device=
<target>
option.
--target-device=
<target>
Specify a target graphics processing unit (GPU) to model cache for. For example, gen11_icl (default), gen12_dg1, or gen9_gt3. See target-device for a full list of possible values and mapping to device names.
Use with the
--enable-cache-simulation
option.
Make sure to specify the same target device as for the
--collect=projection --config=
<config>
.
--data-transfer=
<mode>
Enable modeling data transfers between host and target devices. The following modes are available:
  • Use
    off
    (default) to disable data transfer modeling.
  • Use
    light
    to model only data transfers.
  • Use
    medium
    to model data transfers, attributes memory objects, and tracks accesses to stack memory.
  • Use
    full
    to model data transfers, attributes memory objects, tracks accesses to stack memory, and enables data reuse analysis as well.
--profile-gpu
Analyze a DPC++, OpenCL, or OpenMP target application on a GPU device.
For details about this workflow, see Run GPU-to-GPU Performance Modeling from Command Line.
Dependencies Options
The Dependencies analysis is optional because it adds a high overhead and is mostly necessary if you have scalar loops/functions in your application. For details about when you need to run the Dependencies analysis, see Check How Assumed Dependencies Affect Modeling.
To run the Dependencies analysis, use the following command line action:
--collect=dependencies
.
Recommended action options:
Options
Description
--loop-call-count-limit=
<num>
Set the maximum number of call instances to analyze assuming similar runtime properties over different call instances.
The recommended value is 16.
--select=
<string>
Select loops to run the analysis for.
For the
Offload Modeling
, the recommended value is
--select markup=gpu_generic
, which selects only loops/functions profitable for offloading to a target device to reduce the analysis overhead.
For more information about markup options, see Loop Markup to Minimize Analysis Overhead.
The generic markup strategy is recommended if you want to run the Dependencies analysis for an application that does not use DPC++, C++/Fortran with OpenMP target, or OpenCL.
--filter-reductions
Mark all potential reductions with a specific diagnostic.
Performance Modeling Options
To run the Performance Modeling analysis, use the following command line action:
--collect=projection
.
Recommended action options:
Options
Description
--config=
<config>
Specify a target GPU to model performance for. For example, gen11_icl (default), gen12_dg1, or gen9_gt3. See config for a full list of possible values and mapping to device names.
Make sure to specify the same target device as for the
--collect=tripcounts --target-device=
<target>
.
--no-assume-dependencies
Assume that a loop
does not
have dependencies if a loop dependency type is unknown.
Use this option if your application contains parallel and/or vectorized loops and you did not run the Dependencies analysis.
--data-reuse-analysis
Analyze potential data reuse between code regions when offloaded to a target GPU.
Make sure to use
--data-transfer=full
with
--collect=tripcounts
for this option to work correctly.
--assume-hide-taxes
Assume that an invocation tax is paid only for the
first
time a kernel is launched.
--set-parameter
Specify a single-line configuration parameter to modify in a format
"
<group>
.
<parameter>
=
<new-value>
"
. For example,
"min_required_speed_up=0"
.
For details about the option, see set-parameter. For details about some of the possible modifications, see Advanced Modeling Strategies.
--profile-gpu
Analyze a DPC++, OpenCL, or OpenMP target application on a GPU device.
For details about this workflow, see Run GPU-to-GPU Performance Modeling from Command Line.
See advisor Command Option Reference for more options.

Method 3. Use Python* Scripts

Intel Advisor
has three scripts that use the
Intel Advisor
Python* API to run the
Offload Modeling
. You can run the scripts with the
advisor-python
command line tool or with your local Python 3.6 or 3.7.
The scripts vary in functionality and run different sets of
Intel Advisor
analyses. Depending on what you want to run, use one or several of the following scripts:
  • run_oa.py
    is the simplest script with limited modification flexibility. Use this script to run the collection and modeling steps with a single command. This script is the equivalent of the
    Intel Advisor
    command line collection preset.
  • collect.py
    is a moderately flexible script that runs
    only
    the collection step.
  • analyze.py
    is a moderately flexible script that runs
    only
    the performance modeling step.
The scripts do not support the analysis of MPI applications. For an MPI application, use the per-analysis collection with the
Intel Advisor
CLI.
You can run the
Offload Modeling
using different combinations of the scripts and/or the
Intel Advisor
CLI. For example:
  • Run
    run_oa.py
    to profile application and model its performance.
  • Run the
    collect.py
    to profile application and
    analyze.py
    to model its performance. Re-run
    analyze.py
    to remodel with a different configuration.
  • Run the
    Intel Advisor
    CLI to collect performance data and
    analyze.py
    to model performance. Re-run
    analyze.py
    to remodel with a different configuration.
  • Run
    run_oa.py
    to collect data and model performance for the first time and run
    analyze.py
    to remodel with a different configuration.
Consider the following examples of some typical scenarios with Python scripts.
Info
: In the commands below, make sure to replace the
myApplication
with your application executable path and name
before
executing a command. If your application requires additional command line options, add them
after
the executable name.
Example 1
. Run the
run_oa.py
script to profile an application and model its performance for Intel® Iris® Xe MAX graphics (
gen12_dg1
configuration).
  • On Linux OS:
    advisor-python $APM/run_oa.py ./advi_results --collect=basic --config=gen12_dg1 -- ./myApplication
  • On Windows OS:
    advisor-python %APM%\run_oa.py .\advi_results --collect=basic --config=gen12_dg1 -- myApplication.exe
You will see the result summary printed to the command prompt.
For more useful options, see the
Analysis Details
section below.
Example 2
. Run the
collect.py
to profile an application and run the
analyze.py
to model its performance.
  • On Linux OS:
    1. Collect performance data.
      advisor-python $APM/collect.py ./advi_results --collect=basic --config=gen12_dg1 -- ./myApplication
    2. Model application performance on Intel® Iris® X
      e
      MAX graphics (
      gen12_dg1
      configuration).
      advisor-python $APM/analyze.py ./advi_results --config=gen12_dg1
    You will see the result summary printed to the command prompt.
  • On Windows OS:
    1. Collect performance data.
      advisor-python %APM%\collect.py .\advi_results --collect=basic --config=gen12_dg1 -- myApplication.exe
    2. Model application performance on Intel® Iris® X
      e
      MAX graphics (
      gen12_dg1
      configuration).
      advisor-python %APM%\analyze.py .\advi_results --config=gen12_dg1
For more useful options, see the
Analysis Details
section below.
Analysis Details
Each script has a set of additional options that modify its behavior and collect additional performance data. The more analyses you run and options you use, the higher the modeling accuracy.
Collection Options
The following options are applicable to the
run_oa.py
and
collect.py
scripts.
Option
Description
--collect=
<mode>
Specify data to collect for your application:
  • Use
    basic
    to run only Survey, Trip Counts and FLOP analyses, analyze data transfer between host and device memory, attribute memory objects to loops, and track accesses to stack memory. This value corresponds to the
    Medium
    accuracy.
  • Use
    refinement
    to run only Dependencies analysis. Do not analyze data transfers.
  • Use
    full
    (default) to run Survey, Trip Counts, FLOP, and Dependencies analyses, analyze data transfer between host and device memory and potential data reuse, attribute memory objects to loops, and track accesses to stack memory. This value corresponds to the
    High
    accuracy.
See Check How Assumed Dependencies Affect Modeling to learn when you need to collect dependency data.
--config=
<config>
Specify a target GPU to model performance for. For example, gen11_icl (default), gen12_dg1, or gen9_gt3. See config for a full list of possible values and mapping to device names.
For
collect.py
, make sure to specify the same value of the
--config
option for the
analyze.py
.
--markup=
<markup-mode>
Select loops to collect Trip Counts and FLOP and/or Dependencies data for with a pre-defined markup algorithm. This option decreases collection overhead.
By default, it is set to
generic
to analyze all loops profitable for offloading.
--gpu
Analyze a DPC++, OpenCL, or OpenMP target application on a GPU device.
For details about this workflow, see Run GPU-to-GPU Performance Modeling from Command Line.
For a full list of available options, see:
Performance Modeling Options
The following options are applicable to the
run_oa.py
and
analyze.py
scripts.
Option
Description
--config=
<config>
Specify a target GPU to model performance for. For example, gen11_icl (default), gen12_dg1, or gen9_gt3. See config for a full list of possible values and mapping to device names.
For
analyze.py
, make sure to specify the same value of the
--config
option for the
collect.py
.
--assume-parallel
Assume that a loop does not have dependencies if there is no information about the loop dependency type and you did not run the Dependencies analysis.
--data-reuse-analysis
Analyze potential data reuse between code regions when offloaded to a target GPU.
Make sure to use
--collect=full
when running the analyses with
collect.py
or use the
--data-transfer=full
when running the Trip Counts analysis with
Intel Advisor
CLI.
--gpu
Analyze a DPC++, OpenCL, or OpenMP target application on a GPU device.
For details about this workflow, see Run GPU-to-GPU Performance Modeling from Command Line.
For a full list of available options, see:

View the Results

Intel Advisor
provides several ways to work with the
Offload Modeling
results generated from the command line.
View Results in CLI
After you run Performance Modeling with
advisor --collect=projection
or
analyze.py
, the
result summary
is printed in a terminal or a command prompt. In this summary report, you can view:
  • Description of a
    baseline
    device where application performance was measured and a target device for which the application performance was modeled
  • Executive binary name
  • Top metrics for measured and estimated (accelerated) application performance
  • Top regions recommended for offloading to the target and performance metrics per region
For example:
Info: Selected accelerator to analyze: Intel(R) Gen11 Integrated Graphics Accelerator 64EU. Info: Baseline Host: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, GPU: Intel (R) . Info: Binary Name: 'CFD'. Info: An unknown atomic access pattern is specified: partial_sums_16. Possible values are same, sequential. sequential will be used. Measured CPU Time: 44.858s Accelerated CPU+GPU Time: 16.265s Speedup for Accelerated Code: 3.5x Number of Offloads: 7 Fraction of Accelerated Code: 60% Top Offloaded Regions ------------------------------------------------------------------------------------------------------------------------------------------------------- Location | CPU | GPU | Estimated Speedup | Bounded By | Data Transferred ------------------------------------------------------------------------------------------------------------------------------------------------------- [loop in compute_flux_ser at euler3d_cpu_ser.cpp:226] | 36.576s | 9.340s | 3.92x | L3_BW | 12.091MB [loop in compute_step_factor_ser at euler3d_cpu_ser.... | 0.844s | 0.101s | 8.37x | LLC_BW | 4.682MB [loop in time_step_ser at euler3d_cpu_ser.cpp:361] | 0.516s | 0.278s | 1.86x | L3_BW | 10.506MB [loop in time_step_ser at euler3d_cpu_ser.cpp:361] | 0.456s | 0.278s | 1.64x | L3_BW | 10.506MB [loop in time_step_ser at euler3d_cpu_ser.cpp:361] | 0.432s | 0.278s | 1.55x | L3_BW | 10.506MB -------------------------------------------------------------------------------------------------------------------------------------------------------
See Accelerator Metrics reference for more information about the metrics reported.
View Results in GUI
When you run
Intel Advisor
CLI or Python scripts, an
.advixeproj
project is created automatically in the directory specified with
--project-dir
. This project is interactive and stores all the collected results and analysis configurations. You can view it in the
Intel Advisor
GUI.
To open the project in GUI, you can run the following command from a command prompt:
advisor-gui <project-dir>
If the report does not open, click
Show Result
on the Welcome pane.
You first see a
Summary
report that includes the most important information about measured performance on a baseline device and modeled performance on a target device, including:
  • Main metrics for the modeled performance of your program that indicates if you should offload your application to a target device.
  • Specific factors that prevent your code from achieving a better performance if executed on a target device in the Offload Bounded by.
  • Top five offloaded loops/functions that provide the highest benefit and top five not offloaded loops/functions with the reason why they were not offloaded.
Offload Modeling Summary in GUI
View an Interactive HTML Report
When you execute
Offload Modeling
from CLI,
Intel Advisor
automatically saves two types of HTML reports in the
<project-dir>
/e
<NNN>
/report
directory:
  • Interactive HTML report that represents results in the similar way as GUI and enables you to view key estimated metrics for your application:
    advisor-report.html
    Collect GPU Roofline data to view results for
    Offload Modeling
    and
    GPU Roofline Insights
    perspectives in a single interactive HTML report.
  • Legacy HTML report that enables you to get the detailed information about functions in a call tree, download a configuration file for a target accelerator, and view perspective execution logs:
    report.html
    .
For details about HTML reports, see Work with Standalone HTML Reports.
An additional set of reports is generated in the
<project-dir>
/e
<NNN>
/pp
<NNN>
/data0
directory, including:
  • Multiple CSV reports for different metric groups, such as
    report.csv
    ,
    whole_app_metrics.csv
    ,
    bounded_by_times.csv
    ,
    latencies.csv
    .
  • A graphical representation of the call tree showing the offloadable and accelerated regions named as
    program_tree.dot
    .
  • A graphical representation of the call tree named as
    program_tree.pdf
    , which is generated if a DOT* utility is installed on your system.
  • LOG files, which can be used for debugging and reporting bugs and issues.
These reports are light-weighted and can be easily shared as they do not require
Intel Advisor
GUI.
Save a Read-only Snapshot
A snapshot is a read-only copy of a project result, which you can view at any time using the
Intel Advisor
GUI. To save an active project result as a read-only snapshot:
advisor --snapshot --project-dir=
<project-dir>
[--cache-sources] [--cache-binaries] --
<snapshot-path>
where:
  • --cache-sources
    is an option to add application source code to the snapshot.
  • --cache-binaries
    is an option to add application binaries to the snapshot.
  • <snapshot-path
    is a path and a name for the snapshot. For example, if you specify
    /tmp/new_snapshot
    , a snapshot is saved in a
    tmp
    directory as
    new_snapshot.advixeexpz
    . You can skip this and save the snapshot to a current directory as
    snapshot
    XXX
    .advixeexpz
    .
To open the result snapshot in the
Intel Advisor
GUI, you can run the following command:
advisor-gui
<snapshot-path>
You can visually compare the saved snapshot against the current active result or other snapshot results.

Next Steps

See Identify Code Regions to Offload to understand the results. This section is GUI-focused, but you can still use to it for interpretation.
For details about metrics reported, see Accelerator Metrics.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.