User Guide

Contents

Accelerator Metrics

This reference section describes the contents of data columns in reports of the
Offload Modeling
and
GPU Roofline Insights
perspectives.
# | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | XYZ

2 FPUs Active

Description:
Average percentage of time when both FPUs are used.
Collected
during the
Survey
analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Instructions
column group in the GPU pane of the GPU Roofline Regions tab.

Active

Description
: Percentage of cycles actively executing instructions on all execution units (EUs).
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Array
column group in the GPU pane of the GPU Roofline Regions tab.

Advanced Diagnostics

Description:
Additional information about a code region that might help to understand the achieved performance.
Collected
during the Survey in the
Offload Modeling
perspective and
found
in the Code Regions pane of the Accelerated Regions tab.

Atomic Throughput

Description
: Total execution time by atomic throughput, in milliseconds.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Average Time (GPU Roofline)

Description:
Average amount of time spent executing one task instance.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
Compute Task Details
column.

Average Time (Offload Modeling)

Description:
Average amount of time spent executing one task instance. This metric is only available for the GPU-to-GPU modeling.
Collected
during the Survey analysis with enabled GPU profiling in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Measured
column.

Average Trip Count

Description:
Average number of times a loop/function was executed.
Collected
during the Trip Counts (Characterization) in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Measured
column group.

Bandwidth, GB/sec (GPU Memory)

Description:
Rate at which data is transferred to and from GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
GPU Memory
column group in the GPU pane of the GPU Roofline Regions tab.

Bandwidth, GB/sec (L3 Shader)

Description:
Rate at which data is transferred between execution units and L3 caches, in gigabytes per second.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
L3 Shader
column group in the GPU pane of the GPU Roofline Regions report.

Bandwidth, GB/sec (Shared Local Memory)

Description:
Rate at which data is transferred to and from shared local memory, in gigabytes per second.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Shared Local Memory
column group in the GPU pane of the GPU Roofline Regions report.

Baseline Device

Description:
A host platform that application is executed on.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Measured
column group.

Bounded by

Description:
List of main factors that limit the estimated performance of a code region offloaded to a target device.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Interpretation
: This metric shows one or more bottleneck(s) in a code region.
Category
Bottleneck
Description
Algorithmic
Dependencies
Data dependencies limit the parallel execution efficiency. Fix the dependencies to offload this code region.
Kernel Decomposition
The workload decomposition strategy does not allow to schedule enough parallel threads to use all execution units on a selected target device.
Trip Counts
The number of loop iterations is not enough to use all execution units on a selected target device.
Taxes
Data Transfer
Data transfer tax is greater than the
sum
of the maximum throughput time and latencies time.
Launch Tax
Kernel launch tax is greater than the
sum
of the maximum throughput time and latencies time.
Throughput
Compute
The code region uses full target device capabilities, but the compute time is still high. The time is greater than all other execution time components on a target device.
Global Atomics
Global atomics bandwidth time is greater than all other execution time components on a target device.
Memory Sub-System bandwidth (BW): for example, L3 BW, LLC BW, DRAM BW
Memory sub-system bandwidth time is greater than all other execution time components on a target device.
Latencies
Latencies
Instruction latency is greater than the maximum throughput time.
Resulting estimated time is calculated as a sum of the four factors: throughput, latency, and taxes, which include data transfer taxes and submission tax:
Time =
max_throughput_bottleneck_time
+
non_overlaped_latency
+
data_transfer_time
+
kernel_submission_taxes_time
The model assumes that throughput-defined times are fully "overlapped" and chooses only a "maximum" throughput bottleneck to show in the column. If the impact of other components is comparable to the throughput component, top bottlenecks of
all
four factors (one for throughput, one for latency, and one for data transfer/submission) are shown in this column. This means the code region is limited by this combination of factors, which is ordered by the
impact
on the region performance.
Otherwise, for example, if the relative throughput impact is much higher than the latency and data transfer ones,
only
the maximum throughput bottleneck is shown as dominating over others. If the maximum
throughput time
is compute, Intel Advisor assumes the algorithmic factors (dependencies, kernel decomposition, trip counts) limit offloading a code region.
For example, the combined
Data Transfer, DRAM BW
value means the following:
  • The main limiting factor for the code region is
    data transfer tax
    . The tax is greater than the sum of the maximum throughput time and latencies time for this region.
  • The second limiting factor for the code region is the
    DRAM bandwidth time
    . The time is greater than other execution time components on a target device.
Example of Bounded By metric combination in the Offload Modeling report: Data Transfer, DRAM BW

Call Count

Description:
Number of times a loop/function was invoked.
Collected
during the Trip Counts (Characterization) in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Measured
column group.

Compute

Description:
Estimated execution time assuming an offloaded loop is bound only by compute throughput.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Compute Task

Description:
Name of a compute task.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

Compute Task Details

Description:
Average amount of time spent executing one task instance. When collapsed, corresponds to the
Average
column. Expand to see more metrics.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

Compute Task Purpose

Description:
Action that a compute task performs.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

Computing Threads Started

Description:
Total number of threads started across all execution units for a computing task.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

D

Data Transfer Tax

Description:
Estimated time cost, in milliseconds, for transferring loop data between host and target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Data Transfer Tax without Reuse

Description:
Estimated time cost, in milliseconds, for transferring loop data between host and target platform considering no data is reused. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Data Reuse Gain

Description:
Difference, in milliseconds, between data transfer time estimated with data reuse and without data reuse. This option is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Dependency Type

Description:
Dependency absence or presence in a loop across iterations.
Collected
during the Survey and Dependencies analyses in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Measured
column group.
Possible values:
  • Parallel: Explicit
    - The loop does not have dependencies because it is explicitly vectorized or threaded on CPU.
  • Parallel: Proven
    - A compiler did not detect dependencies in the loop at the compile time but did not vectorize the loop automatically for a certain reason.
  • Parallel: Programming Model
    - The loop does not have dependencies because it is parallelized for execution on a target platform using a performance model (for example, OpenMP*, oneAPI Treading Building Blocks, Intel® oneAPI Data Analytics Library, Data Parallel C++).
  • Parallel: Workload
    - Intel Advisor did not find dependencies in the loop based on the workload analyzed during the Dependencies analysis.
  • Parallel: User
    - The loop is marked as not having dependencies with the
    --set-parallel=<string>
    option.
  • Parallel: Assumed
    - Intel Advisor does not have information about loop dependencies but it assumed all such loops are parallel (that is, not having dependencies).
  • Dependency:
    <dependency-type>
    - Intel Advisor found dependencies of specific types in the loop during the Dependencies analysis. Possible dependency types are RAW (read after write), WAR (write after read), WAW (write after read), Reduction.
  • Dependency: User
    - The loop is marked as having dependencies with the
    --set-dependency=<string>
    option.
  • Dependency: Assumed
    - Intel Advisor does not have information about dependencies for this loops but it assumes all such loops have dependencies.
Prerequisites for collection/display
:
Some values in this column can appear only if you select specific options when collecting data or run the Dependencies analysis:
For
Parallel: Workload
and
Dependency:
<dependency-type>
:
For
Parallel: User
:
  • GUI: Go to
    Project Properties
    Performance Modeling
    . In the
    Other parameters
    field, enter a
    --set-parallel=<string>
    and a comma-separated list of loop IDs and/or source locations to mark them as parallel.
  • CLI: Specify a comma-separated list of loop IDs and/or source locations with the
    --set-parallel=<string>
    option when modeling performance with
    advisor --collect=projection
    .
For
Dependency: User
:
  • GUI: Go to
    Project Properties
    Performance Modeling
    . In the
    Other parameters
    field, enter a
    --set-dependency=<string>
    and a comma-separated list of loop IDs and/or source locations to mark them as having dependencies.
  • CLI: Specify a comma-separated list of loop IDs and/or source locations with the
    --set-dependency=<string>
    option when modeling performance with
    advisor --collect=projection
    .
For
Parallel: Assumed
:
  • GUI: Disable
    Assume Dependencies
    under Performance Modeling analysis in the
    Analysis Workflow
    pane.
  • CLI: Use the
    --no-assume-dependencies
    option when modeling performance with
    advisor --collect=projection
    .
For
Dependencies: Assumed
:
  • GUI: Enable
    Assume Dependencies
    under Performance Modeling analysis in the
    Analysis Workflow
    pane.
  • CLI: Use the
    --assume-dependencies
    option when modeling performance with
    advisor --collect=projection
    .
Interpretation
:
  • Loops with
    no
    real dependencies (
    Parallel: Explicit
    ,
    Parallel: Proven
    ,
    Parallel: Programming Model
    , and
    Parallel: User
    if you know that marked loops are parallel) can be safely offloaded to a target platform.
  • If many loops have
    Parallel: Assumed
    or
    Dependencies: Assumed
    value, you are recommended to run the Dependencies analysis. See Check How Assumed Dependencies Affect Modeling for details.

Device-to-Host Size

Description:
Total data transferred from device to host.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
Data Transferred
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
Data Transferred
column group.

Device-to-Host Time

Description:
Total time spent on transferring data from device to host.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
Data Transferred
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
Data Transferred
column group.

DRAM

Description
: Summary of estimated DRAM memory usage, including DRAM bandwidth (in gigabytes per second) and total DRAM traffic, which is a sum of read and write traffic.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:

DRAM BW (Estimated Bounded By)

Description:
DRAM Bandwidth. Estimated time, in seconds, spent on reading from DRAM memory and writing to DRAM memory assuming a maximum DRAM memory bandwidth is achieved.
Collected
during the Trip Counts analysis (Characterization) and the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

DRAM BW (Memory Estimates)

Description
: DRAM Bandwidth. Estimated rate at which data is transferred to and from the DRAM, in gigabytes per second.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

DRAM BW Utilization

Description
: Estimated DRAM bandwidth utilization, in per cent, calculated as ratio of average bandwidth to a maximum theoretical bandwidth.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

DRAM Read Traffic

Description
: Total estimated data read from the DRAM memory.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

DRAM Traffic

Description
: Estimated sum of data read from and written to the DRAM memory.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

DRAM Write Traffic

Description
: Total estimated data written to the DRAM memory.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

Elapsed Time

Description:
Wall-clock time from beginning to end of computing task execution.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

EU Threading Occupancy

Description:
Percentage of cycles on all execution units (EUs) and thread slots when a slot has a thread scheduled.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

Estimated Data Transfer with Reuse

Description:
Summary of data read from a target platform and written to the target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on the target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.

FP AI

Description:
Ratio of FLOP to the number of transferred bytes.
Collected
during the FLOP analysis (Characterization) enabled in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.

Fraction of Offloads

Description:
A percentage of time spent in code regions profitable for offloading in relation to the total execution time of the region.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Basic Estimated Metrics
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Basic Estimated Metrics
column group.
Interpretation
: 100% means there are no non-offloaded child regions, calls to parallel runtime libraries, or system calls in the region.

From Target

Description:
Estimated data transferred from a target platform to a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

GFLOP

Description:
Number of giga floating-point operations.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.
Instruction types counted during Characterization collection
:
  • BASIC COMPUTE, FMA, BIT, DIV, POW, MATH

GFLOPS

Description:
Number of giga floating-point operations per second.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.
Instruction types counted during Characterization collection
:
  • BASIC COMPUTE, FMA, BIT, DIV, POW, MATH

GINTOP

Description:
Number of giga integer operations.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.
Instruction types counted during Characterization collection
:
  • BASIC COMPUTE, FMA, BIT, DIV, POW, MATH

GINTOPS

Description:
Number of giga integer operations per second.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.
Instruction types counted during Characterization collection
:
  • BASIC COMPUTE, FMA, BIT, DIV, POW, MATH

Global

Description:
Total number of work items in all work groups.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Work Size
column group in the GPU pane of the GPU Roofline Regions tab.

Global Size (Offload Modeling - Compute Estimates)

Description:
Total estimated number of work items in a loop executed after offloaded on a target platform.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Compute Estimates
column group.

Global Size (Offload Modeling - Measured)

Description:
Total number of work items in a kernel instance on a baseline device. This metric is only available for the GPU-to-GPU modeling.
Collected
during the Survey analysis with enabled GPU profiling in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Measured
column group.

GPU Shader Atomics

Description:
Total number of shader atomic memory accesses.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

GPU Shader Barriers

Description:
Total number of shader barrier messages.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

Host-to-Device Size

Description:
Total data transferred from host to device.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
Data Transferred
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
Data Transferred
column group.

Host-to-Device Time

Description:
Total time spent on transferring data from host to device.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
Data Transferred
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
Data Transferred
column group.

Idle

Description:
Percentage of cycles on all execution units (EUs), during which no threads are scheduled on a EU.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Array
column group in the GPU pane of the GPU Roofline Regions report.

Ignored Time

Description:
Time spent in system calls and calls to ignored modules or parallel runtime libraries in the code regions recommended for offloading.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Non-User Code Metrics
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
From CLI, run the
--collect=projection
action with the
ignore=<code-to-ignore>
action option. For example, to ignore MPI and OpenMP calls, use the flag as follows:
--ignore=mpi,omp
.
Prerequisite for display
: Expand the
Time in Non-User Code
column group.
Interpretation
: Time in the ignored code parts is not used for the Offload Modeling estimations. It does not affect time estimated for offloaded code regions.

Instances (GPU Roofline)

Description:
Total number of times a task executes on a GPU.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisite for display
: Expand the
Compute Task Details
column group.

Instances (Offload Modeling - Compute Estimates)

Description:
Total estimated number of times a loop executes on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Compute Estimates
column group.

Instances (Offload Modeling - Measured)

Description:
Total number of times a loop executes on a baseline GPU device.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.

INT AI

Description:
Ratio of INTOP to the number of transferred bytes.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions report.
Instruction types counted during Characterization collection
:
  • BASIC COMPUTE, FMA, BIT, DIV, POW, MATH

IPC Rate

Description:
Average rate of instructions per cycle (IPC) calculated for two FPU pipelines.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Instructions
column group in the GPU pane of the GPU Roofline Regions report.

Iteration Space

Description:
Summary of iteration metrics measured on a baseline device.
Collected
during the Trip Counts (Characterization) analysis (for CPU regions) or the Survey analysis with enabled GPU profiling (for GPU regions) in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Interpretation
: For the CPU-to-GPU modeling, this column reports the following metrics:
  • Call Count - The number of times a loop/function was invoked.
  • Trip Counts (average) - The average number of times a loop/function was executed.
  • Vector ISA - The highest vector instruction set architecture (ISA) used for individual instructions.
For the GPU-to-GPU modeling, this column reports the following metrics:
  • Global - Total number of work items in all work groups.
  • Local - The number of work items in one work group.
  • SIMD - The number of work items processed by a single GPU thread.

J

Kernel

Description
: GPU kernel name. This metric is only available for the GPU-to-GPU modeling.
Collected
during the Survey analysis in the
Offload Modeling
perspective.

Kernel Launch Tax

Description:
Total estimated time cost for invoking a kernel when offloading a loop to a target platform.
Does not include data transfer costs.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Latencies

Description:
Top uncovered latency in a loop/function, in milliseconds.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.

L3 BW

Description:
L3 Bandwidth. Estimated time, in seconds, spent on reading from L3 cache and writing to L3 cache assuming a maximum L3 cache bandwidth is achieved.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

L3 Cache

Description
: Summary of estimated L3 cache usage, including L3 cache bandwidth (in gigabytes per second) and L3 cache traffic, which is a sum of read and write traffic.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:

L3 Cache BW

Description
: Average estimated rate at which data is transferred to and from the L3 cache, in gigabytes per second.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

L3 Cache BW Utilization

Description
: Estimated L3 cache bandwidth utilization, in per cent, calculated as ratio of average bandwidth to a maximum theoretical bandwidth.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

L3 Cache Line Utilization

Description:
L3 cache line utilization for data transfer, in percentage.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
CARM (EU <-> Data Port)
column group in the GPU pane of the GPU Roofline Regions tab.

L3 Cache Read Traffic

Description
: Total estimated data read from the L3 cache.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

L3 Cache Traffic

Description
: Estimated sum of data read from and written to the L3 cache.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

L3 Cache Write Traffic

Description
: Total estimated data written to the L3 cache.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

LLC

Description
: Estimated last-level cache (LLC) usage, including LLC cache bandwidth (in gigabytes per second) and total LLC cache traffic, which is a sum of read and write traffic.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:

LLC BW (Offload Modeling - Estimated Bounded by)

Description:
Last-level cache (LLC) bandwidth. Estimated time, in seconds, spent on reading from LLC and writing to LLC assuming a maximum LLC bandwidth is achieved.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

LLC BW (Offload Modeling - Memory Estimations)

Description
: Estimated rate at which data is transferred to and from the LLC cache, in gigabytes per second.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

LLC BW Utilization

Description
: Estimated LLC cache bandwidth utilization, in per cent, calculated as ratio of average bandwidth to a maximum theoretical bandwidth.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

LLC Read Traffic

Description
: Total estimated data read from the LLC cache.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

LLC Traffic

Description
: Estimated sum of data read from and written to the LLC cache.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

LLC Write Traffic

Description
: Total estimated data written to the LLC cache.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

Load Latency

Description:
Uncovered cache or memory load latencies uncovered in a code region, in milliseconds.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the
Code Regions
pane.
Prerequisite for display
:
Estimated Bounded By
column group.

Local

Description:
Number of work items in one work group.
Collected
during the
Survey
analysis in the
GPU Roofline Insights
perspective and
found
in the
Work Size
column group in the GPU pane of the GPU Roofline Regions report.

Local Memory Size

Description:
Local memory size used by each thread group.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisite for display
: Expand the
Compute Task Details
column group.

Local Size (Offload Modeling - Compute Estimates)

Description:
Total estimated number of work items in one work group of a loop executed after offloaded on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Compute Estimates
column group.

Local Size (Offload Modeling - Measured)

Description:
Total number of work items in one work group of a kernel. This metric is only available for the GPU-to-GPU modeling.
Collected
during the Survey analysis with enabled GPU profiling in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Measured
column group.

Loop/Function

Description
: Name and source location of a loop/function in a region, where region is a sub-tree of loops/functions in a call tree.
Collected
during the Survey analysis in the
Offload Modeling
perspective.

Module

Description:
Program module name.
Collected
during the Survey in the
Offload Modeling
perspective and
found
in the
Location
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Location
column group.

N

Offload Tax

Description:
Total time spent for transferring data and launching kernel, in milliseconds.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Offload Summary

Description
: Conclusion that indicates whether a code region is profitable for offloading to a target platform. In the Top-Down pane, it also reports the node position, such as offload child loops and child functions.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Basic Estimated Metrics
column group in the Code Regions pane of the Accelerated Regions tab.

Overall Non-Accelerable Time

Description:
Total estimated time spent in non-offloaded parts of offloaded code regions.
Collected
during the Survey and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Time in Non-User Code
column group in the Code Regions pane of the Accelerated Regions tab.
Interpretation
: These code parts are located inside offloaded regions, but the performance model assumes these parts are executed on a baseline device. Examples of such code parts are OpenMP* code parts, Data Parallel C++ (DPC++) runtimes, and system calls.

Parallel Factor

Description
: Number of loop iterations or kernel work items executed in parallel on a target device for a loop/function.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the Code Regions pane of the Accelerated Regions tab.

Parallel Threads

Description
: Estimated number of threads scheduled simultaneously on
all
execution units (EU).
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Compute Estimates
column group.

Performance Issues (GPU Roofline)

Description
: Performance issues and recommendations for optimizing code regions executed on a GPU.
Collected
during the Survey, Characterization, and Performance Modeling analyses in the
GPU Roofline Insights
perspective and found in the GPU pane of the GPU Roofline Regions tab.
Interpretation
: Click to view the full recommendation text with code examples and recommended fixes in the Recommendations pane in the GPU Roofline Regions tab.

Performance Issues (Offload Modeling)

Description
: Recommendations for offloading code regions with estimated performance summary and/or potential issues with optimization hints.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the Code Regions pane of the Accelerated Regions tab.
Interpretation
: Click to view the full recommendation text with examples of using DPC++ and OpenMP* programming modeling to offload the code regions and/or fix the performance issue in the Recommendations pane in the Accelerated Regions tab.

Private

Description:
Total estimated data transferred to a private memory from a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfers with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfers with Reuse
column group.

Private Memory Size

Description:
Private memory size allocated by a compiler to each thread.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisite for display
: Expand the
Compute Task Details
column group.

Programming Model

Description:
Programming model used in a loop/function, if any.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Measured
column group in the
Code Regions
pane.
Prerequisite for display
: Expand the
Measured
column group.

Q

Read

Description:
Estimated data read from a target platform by an offload region, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) analysis in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfers with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfers with Reuse
column group.

Read, GB (GPU Memory)

Description:
Total amount of data read from GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
GPU Memory
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
GPU Memory
column group.

Read, GB (Shared Local Memory)

Description:
Total amount of data read from the shared local memory, in gigabytes.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Shared Local Memory
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
Shared Local Memory
column group.

Read, GB/s (GPU Memory)

Description:
Rate at which data is read from GPU, chip uncore (LLC), and main memory, in gigabytes per second.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
GPU Memory
column group in the GPU pane of the GPU Roofline Regions report.
Prerequisites for display
: Expand the
GPU Memory
column group.

Read, GB/s (Shared Local Memory)

Description:
Rate at which data is read from shared local memory, in gigabytes per second.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Shared Local Memory
column group in the GPU pane of the GPU Roofline Regions report.
Prerequisites for display
: Expand the
Shared Local Memory
column group.

Read without Reuse

Description:
Estimated data read from a target platform by a code region considering no data is reused between kernels, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfers with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Data Transfers with Reuse
column group.

Send Active

Description:
Percentage of cycles on all execution units when EU Send pipeline is actively processed.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Instructions
column group in the GPU pane of the GPU Roofline Regions report.

SIMD Width (GPU Roofline)

Description:
The number of work items processed by a single GPU thread.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions report.
Prerequisites for display
: Expand the
Compute Task Details
column group.

SIMD Width (Offload Modeling - Compute Estimates)

Description:
Estimated number of work items processed by a single thread on a target platform.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Compute Estimates
column group.

SIMD Width (Offload Modeling - Measured)

Description:
Number of work items processed by a single thread on a baseline device. This metric is only available for the GPU-to-GPU modeling.
Collected
during the Survey analysis with enabled GPU profiling analysis in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Measured
column group.

SLM

Description
: Summary of estimated SLM usage, including SLM bandwidth (in gigabytes per second) and SLM traffic, which is a sum of read and write traffic.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection
:
  • For CPU-to-GPU modeling, run the
    --collect=projection
    action with the
    --enable-slm
    option.
  • For GPU-to-GPU modeling, the metric is available by default.

SLM BW (Offload Modeling - Estimated Bounded by)

Description:
Shared Local Memory (SLM) Bandwidth. Estimated time, in seconds, spent on reading from SLM and writing to SLM assuming a maximum SLM bandwidth is achieved.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection
:
  • For CPU-to-GPU modeling, run the
    --collect=projection
    action with the
    --enable-slm
    option.
  • For GPU-to-GPU modeling, the metric is available by default.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

SLM BW (Offload Modeling - Memory Estimations)

Description
: Average estimated rate at which data is transferred to and from the SLM. This is a dynamic value, and depending on the bandwidth value, it can be measured in bytes per second, kilobytes per second, megabytes per second, and so on.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection
:
  • For CPU-to-GPU modeling, run the
    --collect=projection
    action with the
    --enable-slm
    option.
  • For GPU-to-GPU modeling, the metric is available by default.
Prerequisites for display
: Expand the
Memory Estimations
column group.

SLM BW Utilization

Description
: Estimated SLM bandwidth utilization, in per cent, calculated as ratio of average bandwidth to a maximum theoretical bandwidth.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection
:
  • For CPU-to-GPU modeling, run the
    --collect=projection
    action with the
    --enable-slm
    option.
  • For GPU-to-GPU modeling, the metric is available by default.
Prerequisites for display
: Expand the
Memory Estimations
column group.

SLM Read Traffic

Description
: Total estimated data read from the SLM.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection
:
  • For CPU-to-GPU modeling, run the
    --collect=projection
    action with the
    --enable-slm
    option.
  • For GPU-to-GPU modeling, the metric is available by default.
Prerequisites for display
: Expand the
Memory Estimations
column group.

SLM Traffic

Description
: Estimated sum of data read from and written to the SLM.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection
:
  • For CPU-to-GPU modeling, run the
    --collect=projection
    action with the
    --enable-slm
    option.
  • For GPU-to-GPU modeling, the metric is available by default.
Prerequisites for display
: Expand the
Memory Estimations
column group.

SLM Write Traffic

Description
: Total estimated data written to the SLM.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection
:
  • For CPU-to-GPU modeling, run the
    --collect=projection
    action with the
    --enable-slm
    option.
  • For GPU-to-GPU modeling, the metric is available by default.
Prerequisites for display
: Expand the
Memory Estimations
column group.

Source Location

Description:
Source file name and line number.
Collected
during the Survey in the
Offload Modeling
perspective and
found
in the
Location
column group in the Code Regions pane of the Accelerated Regions tab.
Interpretation
: Use this column to understand where a code region is located.

Stalled

Description:
Percentage of cycles on all execution units (EUs) when at least one thread is scheduled, but the EU is stalled.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Array
column group in the GPU pane of the GPU Roofline Regions report.

SVM Usage Type

Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions report.
Prerequisites for display
: Expand the
Compute Task Details
column group.

Speed-up

Description:
Estimated speedup after a loop is offloaded to a target device, in comparison to the original elapsed time.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Basic Estimated Metrics
column group in the Code Regions pane of the Accelerated Regions tab.
Interpretation
:

Taxes with Reuse

Description:
The highest estimated time cost and a sum of all other costs for offloading a loop from host to a target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform. A
triangle
icon in a table cell indicates that this region reused data.
This decreases the estimates data transfer tax.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the Code Regions pane of the Accelerated Regions tab.

Thread Occupancy (Offload Modeling - Compute Estimates)

Description
: Average percentage of thread slots occupied on all execution units estimated on a target device.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Compute Estimates
column group.

Thread Occupancy (Offload Modeling - Measured)

Description
: Average percentage of thread slots occupied on all execution units measured on a baseline device. This metric is only available for the GPU-to-GPU modeling.
Collected
during the Survey analysis with enabled GPU profiling in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Measured
column group.

Threads per EU

Description
: Estimated number of threads scheduled simultaneously
per execution unit (EU)
.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Compute Estimates
column group.

Throughput

Description:
Top two factors that a loop/function is bounded by, in milliseconds.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the
Code Regions
pane.

Time (Estimated)

Description:
Estimated elapsed wall-clock time from beginning to end of loop execution estimated on a target platform after offloading.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Basic Estimated Metrics
column group in the Code Regions pane of the Accelerated Regions tab.

Time (Measured)

Description:
Elapsed wall-clock time from beginning to end of loop execution measured on a host platform.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.

Time by DRAM BW

Description
: Estimated time, in seconds, spent on reading from DRAM memory and writing to DRAM memory assuming a maximum DRAM memory bandwidth is achieved.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

Time by L3 Cache BW

Description
: Estimated time, in seconds, spent on reading from L3 cache and writing to L3 cache assuming a maximum L3 cache bandwidth is achieved.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

Time by LLC BW

Description
: Estimated time, in seconds, spent on reading from LLC and writing to LLC assuming a maximum LLC bandwidth is achieved.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

Time by SLM BW

Description
: Estimated time, in seconds, spent on reading from SLM and writing to SLM assuming a maximum SLM bandwidth is achieved.
Collected
during the Trip Counts (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection
:
  • For CPU-to-GPU modeling, run the
    --collect=projection
    action with the
    --enable-slm
    option.
  • For GPU-to-GPU modeling, the metric is available by default.
Prerequisites for display
: Expand the
Memory Estimations
column group.

To Target

Description:
Estimated data transferred to a target platform from a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

ToFrom Target

Description:
Sum of estimated data transferred both to/from a shared memory to/from a target platform by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

Total

Description:
Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform, for an offload loop, in megabytes. It is calculated as
(MappedTo + MappedFrom + 2*MappedToFrom)
. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

Total, GB (CARM)

Description:
Total data transferred to and from execution units, in gigabytes.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
CARM (EU <-> Data Port)
column group in the GPU pane of the GPU Roofline Regions tab.

Total, GB (GPU Memory)

Description:
Total amount of data transferred to and from GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected
during the Survye analysis in the
GPU Roofline Insights
perspective and
found
in the
GPU Memory
column group in the GPU pane of the GPU Roofline Regions tab.

Total, GB (L3 Shader)

Description:
Total amount of data transferred between execution units and L3 caches, in gigabytes.
Collected
during the Survye analysis in the
GPU Roofline Insights
perspective and
found
in the
L3 Shader
column group in the GPU pane of the GPU Roofline Regions report.

Total, GB (Shared Local Memory)

Description:
Total amount of data transferred to and from the shared local memory.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Shared Local Memory
column group in the GPU pane of the GPU Roofline Regions tab.

Total, GB/s

Description:
Average data transfer bandwidth between CPU and GPU.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Data Transferred
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
Data Transferred
column group.
Interpretation
: In some cases, for example,
clEnqueueMapBuffer
, data transfers might generate high bandwidth because memory is not copied but shared using L3 cache.

Total Size

Description:
Total data processed on a GPU.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Data Transferred
column group in the GPU pane of the GPU Roofline Regions tab.

Total Time

Description:
Total amount of time spent executing a task.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group.
Prerequisites for display
: Expand the
Data Transferred
column group.

Total Time in DAAL Calls

Description:
Total time spent in Intel® Data Analytics Acceleration Library (Intel® DAAL) calls in an offloaded code region.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Time in Non-User Code
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Time in Non-User Code
column group.
Interpretation
: If the value in the column is >0s, the code region contains Intel DAAL calls.

Total Time in DPC++ Calls

Description:
Total time spent in Data Parallel C++ (DPC++) calls in an offloaded code region.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Time in Non-User Code
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Time in Non-User Code
column group.
Interpretation
: If the value in the column is >0s, the code region contains DPC++ calls.

Total Time in MPI Calls

Description:
Total time spent in MPI calls in an offloaded code region.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Time in Non-User Code
column group in the Code Regions pane of the Accelerated Regions tab.
Interpretation
: If the value in the column is >0s, the code region contains MPI calls.

Total Time in OpenCL Calls

Description:
Total time spent in OpenCL™ calls in an offloaded code region.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Time in Non-User Code
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Time in Non-User Code
column group.
Interpretation
: If the value in the column is >0s, the code region contains OpenCL calls.

Total Time in OpenMP Calls

Description:
Total time spent in OpenMP* calls in an offloaded code region.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Time in Non-User Code
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Time in Non-User Code
column group.
Interpretation
: If the value in the column is >0s, the code region contains OpenMP calls.

Total Time in System Calls

Description:
Total time spent in system calls in an offloaded code region.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Time in Non-User Code
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Time in Non-User Code
column group.
Interpretation
: If the value in the column is >0s, the code region contains system calls.

Total Time in TBB Calls

Description:
Total time spent in Intel® oneAPI Threading Building Blocks (oneTBB) calls in an offloaded code region.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Time in Non-User Code
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Time in Non-User Code
column group.
Interpretation
: If the value in the column is >0s, the code region contains oneTBB calls.

Total Trip Count

Description:
Total number of times a loop/function was executed.
Collected
during the Trip Counts (Characterization) in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Measured
column group.

Total without Reuse

Description:
Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform considering no data is reused, in megabytes. It is calculated as
(MappedTo + MappedFrom + 2*MappedToFrom)
. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

Unroll Factor

Description:
Loop unroll factor applied by the compiler.
Collected
during the Survey in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Measured
column group.

Vector ISA

Description:
The highest vector Instruction Set Architecture (ISA) used for individual instructions.
Collected
during the Survey in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Measured
column group.

Vector Length

Description:
The number of elements processed in a single iteration of vector loops or the number of elements processed in individual vector instructions determined by a binary static analysis or an Intel Compiler.
Collected
during the Survey in the
Offload Modeling
perspective and
found
in the
Measured
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for display
: Expand the
Measured
column group.

Why Not Offloaded

Description:
A reason why a code region is not recommended for offloading to a target GPU device.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Basic Estimated Metrics
column group in the Code Regions pane of the Accelerated Regions tab.
Interpretation
: See Investigate Non-Offloaded Code Regions for description of available reasons.

Write

Description:
Estimated data written to a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Use the
    --data-transfer=[full | medium | light]
    option with the
    --collect=tripcounts
    action.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

Write, GB (GPU Memory)

Description:
Total amount of data written to GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
GPU Memory
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
GPU Memory
column group.

Write, GB (Shared Local Memory)

Description:
Total amount of data written to the shared local memory, in gigabytes.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Shared Local Memory
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
Shared Local Memory
column group.

Write, GB/s (GPU Memory)

Description:
Rate at which data is written to GPU, chip uncore (LLC), and main memory, in gigabytes per second.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
GPU Memory
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
GPU Memory
column group.

Write, GB/s (Shared Local Memory)

Description:
Rate at which data is written to shared local memory, in gigabytes per second.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Shared Local Memory
column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display
: Expand the
Shared Local Memory
column group.

Write without Reuse

Description:
Estimated data written to a target platform by a code region considering no data is reused, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

X, Y, Z

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.