User Guide

Contents

Investigate Non-Offloaded Code Regions

The modeling step analyzes code regions profitability for offloading to a target device. Some regions might be not profitable for offloading or cannot be modeled.
If you view a result in the
Intel® Advisor GUI
: To see details why your code region of interest is reported as not recommended for offloading, select a loop in a
Code Regions
pane and see the
Details
tab in the right-side pane for detailed loop information, including the reason why the loop is not recommended for offloading.
By default, the report shows all code regions. You can apply filters to see only regions recommended or not recommended for offloading: open the drop-down list and select the desired filter option.
If you view the result in an
Offload Modeling HTML report
: Go to the
Non-Offloaded Regions
tab and examine the
Why Not Offloaded
column in the
Offload Information
group to the reason why a code region is not recommended for offloading to a selected target platform.
For each region not recommended for offloading, you can force offload modeling. See the Enforce Offloading for Specific Loops.

Cannot Be Modeled

Message
Cause and Details
Solution
Cannot be modeled: Outside of Marked Region
Intel® Advisor
cannot model performance for a code region because it is not marked up for analysis.
Make sure a code region satisfies all markup rules or use a different markup strategy:
  • It is not a system module or a system function.
  • It has instruction mixes.
  • It is executed.
  • Its execution time is not less than 0.02 seconds.
Cannot be modeled: Not Executed
A code region is in the call tree, but the
Intel Advisor
detected no calls to it for a dataset used during Survey.
This can happen if execution time of the loop is very small and close to the sampling interval of the
Intel Advisor
. Such loops can have significant inaccuracies in time measurement. By default, the sampling interval for Survey is 0.01 seconds.
You can try to decrease the sampling interval of the
Intel Advisor
:
  • From GUI:
    1. Go to
      Project Properties
      Survey Hotspots Analysis
      Advanced
      .
    2. Set the
      Sampling Interval
      to less than 10ms.
    3. Re-run
      Offload Modeling
      .
  • From CLI: Use the
    --interval
    option when running the
    --collect=survey
    .
Cannot be modeled: Internal Error
Internal Error
means incorrect data or lack of data because the
Intel Advisor
encountered issues when collecting or processing data.
Try to re-run the
Offload Modeling
perspective to fix the metrics attribution problem. If this does not help, use the Analyzers Community forum for technical support.
Cannot be modeled: System Module
This code region is a system function/loop.
This is not an issue. If this code region is inside an offload region, or a runtime call, its execution time is added to execution time of offloaded regions.
Cannot be modeled: No Execution Count
The
Intel Advisor
detected no calls to a loop during Trip Count step of the Characterization analysis and no information about execution counts is available for this loop.
Check what loop is executed at this branch.
If you see a wrong loop, try re-running the
Offload Modeling
to fix the metrics attribution problem.

Less or Equally Profitable Than Children/Parent Offload

This message is not an issue. It means that
Intel Advisor
has found a more profitable code region to offload. If you still want to see offload estimations for the original code region, use the solutions described in the table below.
Message
Cause and Details
Solution
Less or equally profitable than children offloads
Offloading child loops/functions of this code region is more profitable than offloading the whole region with all its children. This means that the execution
Time
estimated on a target platform for the region of interest is greater than or equal to the sum of
Time
values estimated on a target platform for
its child regions
profitable for offloading.
The following reasons might prevent offloading: total execution time, taxes, trip counts, dependencies.
Model offloading for specific code regions even if they are not profitable. See Enforce Offloading for Specific Loops for details.
Less or equally profitable than parent offload
Offloading a whole parent code region of the region of interest is more profitable than offloading any of its child regions separately. This means that the
Time
estimated on a target platform for the region of interest is greater than or equal to the
Time
estimated on the target platform for
its parent region
.
Offloading a child code region might be limited by high offload taxes.
Solution 1
If you assume the kernel execution should overlap offload taxes, use the
--assume-hide-taxes
option with
--collect=projection
action option or the
analyze.py
script. See Manage Invocation Taxes for details.
Solution 2
Model offloading for only specific code regions even if they are not profitable. See Enforce Offloading for Specific Loops for details.

Not Profitable

Message
Cause and Details
Solution
Not profitable: Parallel execution efficiency is limited due to Dependencies
Dependencies limit parallel execution and the code region cannot benefit from offloading to a target device. The estimated execution time after acceleration is greater than or equal to the original execution time.
Solution 1
Ignore assumed dependencies and model offloading for all or selected code regions:
  • From GUI:
    1. Go to
      Project Properties
      Performance Modeling
      .
    2. Enter one of the options in the
      Other Parameters
      field:
      • --no-assume-dependencies
        to assume
        all
        code regions that do not have information about their dependency are parallel
      • --set-parallel=[<loop-IDs/source-locations>]
        to ignore dependencies for specified code regions
    3. Re-run Performance Modeling.
  • From CLI: When running
    --collect=projection
    or
    analyze.py
    , use one of the following:
    • --no-assume-dependencies
      to ignore dependencies for
      all
      code regions
    • --set-parallel=[<loop-IDs/source-locations>]
      to ignore dependencies for specified code regions
Solution 2
If you did not enable the Dependencies analysis when collecting data, run the analysis as follows to get detailed information about real dependencies in your code:
  • From GUI: Enable the Dependencies and Performance Modeling analyses from the
    Analysis Workflow
    pane and re-run the perspective.
  • From CLI: Run the Dependencies analysis with
    --collect=dependencies
    and re-run the Performance Modeling with
    --collect=projection
    or
    analyze.py
    .
See Dependency Type metric description for details.
Not profitable: The Number of Loop Iterations is not enough to fully utilize Target Platform capabilities
The loop cannot benefit from offloading to a target platform as it has a low number of iterations.
In most cases, such code regions cannot benefit from offloading. If you assume that during code migration, the amount of parallel work grows and a loop is broken down into several chunks by a compiler or a program model, use the following workaround:
  • From GUI:
    1. Go to
      Project Properties
      Performance Modeling
      .
    2. Enter
      --batching
      or
      --threads=<target-threads>
      in the
      Other Parameters
      field.
      <target-threads>
      is the number of parallel threads equal to the target device capacity.
    3. Re-run Performance Modeling.
  • From CLI: When running
    --collect=projection
    or
    analyze.py
    , use one of the following:
    • -
      -batching
      to model batching-like techniques
    • --threads=<target-threads>
      , where
      <target-threads>
      is the number of parallel threads equal to the target device capacity
If you enable batching, the kernel invocation tax might grow. You can use the
--assume-hide-taxes
option to reduce the task. See Manage Invocation Taxes for details.
Not profitable: Data Transfer Tax is greater than Computation Time and Memory Bandwidth Time
Time spent on transferring data to a target device is greater than
compute time
and
memory bandwidth time
. The resulting time estimated on a target platform with data transfer tax is greater than or equal to the time measured on a host platform.
Check the
Bounded By
and
Data Transfer Tax
columns in the
Estimated Bounded By
column group and the
Estimated Data Transfer with Reuse
column group. Large value means that this code region cannot benefit from offloading.
See Bounded by for details about metric interpretation.
If you still want to offload such regions, disable data transfer analysis with the
--data-transfer=off
to use only estimated execution time for speedup and profitability calculation.
This option disables data transfer analysis for all loops. You might get different performance modeling results for all loops.
If you already collected data transfer metrics, you can turn off modeling data transfer tax with the command line option
--hide-data-transfer-tax
.
Not profitable: Computation Time is high despite the full use of Target Platform capabilities
The code region uses full target platform capabilities, but time spent for compute operations is still high. As a result, the execution time estimated on a target platform is greater than or equal to the time measured on a host platform.
Check the value in the
Compute
column in the
Estimated Bound-by
column group. Unexpectedly high value means one of the following:
  • There is a problem with a programming model used.
  • Target GPU compute capabilities are lower than baseline CPU compute capabilities.
  • Internal
    Intel Advisor
    error happened caused by incorrect compute time estimation.
Not profitable:
Cache/Memory
Bandwidth Time is greater than other execution time components on Target Device
The time spent in
cache or memory bandwidth
takes a big part of the time estimated on a target platform. As a result, it is greater than or equal to the time measured on a host platform.
In the report, the
Cache/Memory
is replaced with a specific cache or memory level that prevents offloading, for example, L3 or LLC. See the
Throughput
column for details about the highest bandwidth time.
  1. Examine code region children to identify which part takes most of the time and prevents offloading.
  2. Optimize the part of your code that takes most of the time measured on a baseline platform and rerun the perspective.
Not profitable because of offload overhead (taxes)
Total time of offload taxes, which includes
Kernel Launch Tax
,
Data Transfer Tax
, takes a big part of the time estimated on a target platform. As a result, it is greater than or equal to the time measured on a host platform.
Examine the
Taxes with Reuse
column in the
Estimated Bounded by
group for the biggest and total time taxes paid for offloading the code region to a target platform. Expand the
Estimated Bounded by
 group to see a full picture of time taxes paid for offloading the region to the target platform. Big value in any of the columns means that this code region cannot benefit from offloading because the cost of offloading is high.
If kernel launch tax is large and you assume the kernel execution should overlap the launch tax, model hiding the launch taxes as follows:
  • From GUI: Enabled the
    Single Kernel Launch Tax
    option from the
    Analysis Workflow
    pane and rerun the Performance Modeling analysis.
  • From CLI: Use the
    --assume-hide-taxes
    option with the
    --collect=projection
    or
    analyze.py
See Manage Invocation Taxes for details.
Not profitable: Kernel Launch Tax is greater than Kernel Execution Time and Data Transfer Time
Time spent on launching a kernel is greater than execution time estimated on a target platform and estimated data transfer time. The resulting time estimated on the target platform with data transfer tax is greater than or equal to the time measured on a host platform.
Examine the
Bounded By
and
Kernel Launch Tax
columns in the
Estimated Bounded By
column group.
See Bounded by for details about metric interpretation.
High value in
Kernel Launch Tax
means that the
Intel Advisor
detects high call count for a potentially profitable code region and assumes that the kernel invocation tax is paid as many times as the kernel is launched. For this reason, it assumes that the code region cannot benefit from offloading.
If you assume the kernel execution should overlap the launch tax, model hiding the launch taxes as follows:
  • From GUI: Select the
    Single Kernel Launch Tax
    checkbox for the Performance Modeling analysis.
  • From CLI: Use the
    --assume-hide-taxes
    option with the
    --collect=projection
    action option or analyze.py.
For details, see Manage Invocation Taxes.
Not profitable: Atomic Throughput Time is greater than other execution time components on a Target Device
Atomic operations include loading, changing, and storing data to make sure it is not affected by other threads between the calls.
When modeling atomic operations, Intel Advisor assumes that
all
threads wait for each other, so
Atomic Throughput
time might be high and can be one of the main hotspots.
Go to the Analyzers Community forum for technical support and advice.
Not profitable: Instruction Latency is greater than Compute Time and Memory Bandwidth Time
Each memory read instruction produces a GPU thread stall. The stall is called a
memory latency
. Usually, execution of other threads can overlap it.
However, sometimes the amount of non-overlapped latency has a big impact on performance.
Intel Advisor
can estimate the non-overlapped memory latency and add it to the kernel estimated execution time.
If you reduce thread occupancy, it can increase the amount of non-overlapped memory latency
Examine the
Latency
column to see how much time spent for load latency and the
Thread Occupancy
column to understand the reason for this. Low occupancy means that this is the reason for a high load latency. In this case, when offloading the code, increase the kernel parallelism or cover latency with other instructions.
If you are sure that the load latency is overlapped with compute instructions in your code, you can enable latency hiding mode with the following:
  • From GUI:
    1. Go to
      Project Properties
      Performance Modeling
      .
    2. Enter
      --count-send-latency=first
      in the
      Other Parameters
      field.
    3. Re-run Performance Modeling.
  • From CLI: Use the
    --count-send-latency=first
    option with the
    --collect=projection
    action option or
    analyze.py
    .

N/A - Part of Offload

This means that offloading a code region is less profitable than offloading its outer loop.
This is not an issue. The code region of interest is located inside of an offloaded loop.

Total Time Is Too Small for Reliable Modeling

This means the execution time of a code region or a whole loop nest is less than 0.02 seconds. In this case,
Intel Advisor
cannot estimate the speedup correctly and say if it is worth to offload the code regions because its execution time is close to the sampling interval of the
Intel Advisor
.
Possible Solution
If you want to check the profitability of offloading code regions with total time less than 0.02 seconds:
  • From GUI:
    1. Go to
      Project Properties
      Performance Modeling
      .
    2. Enter the
      --loop-filter-threshold=0
      option to the
      Other parameters
      field to model such small offloads.
    3. Re-run Performance Modeling.
  • From CLI: Use the
    --loop-filter-threshold=0
    option with the
    --collect=projection
    or
    analyze.py
    .

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.