Intel® Advisor User Guide

ID 766448
Date 11/07/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Investigate Non-Offloaded Code Regions

The modeling step analyzes code regions profitability for offloading to a target device. Some regions might be not profitable for offloading or cannot be modeled.

If you view a result in the Intel® Advisor GUI: To see details why your code region of interest is reported as not recommended for offloading, select a loop in a Code Regions pane and see the Details tab in the right-side pane for detailed loop information, including the reason why the loop is not recommended for offloading.

By default, the report shows all code regions. You can apply filters to see only regions recommended or not recommended for offloading: open the drop-down list and select the desired filter option.

If you view the result in an Offload Modeling HTML report: Go to the Non-Offloaded Regions tab and examine the Why Not Offloaded column in the Offload Information group to the reason why a code region is not recommended for offloading to a selected target platform.

TIP:
For each region not recommended for offloading, you can force offload modeling. See the Enforce Offloading for Specific Loops.

Cannot Be Modeled

Message

Cause and Details

Solution

Cannot be modeled: Outside of Marked Region

Intel® Advisor cannot model performance for a code region because it is not marked up for analysis.

Make sure a code region satisfies all markup rules or use a different markup strategy:

  • It is not a system module or a system function.
  • It has instruction mixes.
  • It is executed.
  • Its execution time is not less than 0.02 seconds.

Cannot be modeled: Not Executed

A code region is in the call tree, but the Intel Advisor detected no calls to it for a dataset used during Survey.

This can happen if execution time of the loop is very small and close to the sampling interval of the Intel Advisor. Such loops can have significant inaccuracies in time measurement. By default, the sampling interval for Survey is 0.01 seconds.

You can try to decrease the sampling interval of the Intel Advisor:

  • From GUI:
    1. Go to Project Properties > Survey Hotspots Analysis > Advanced.
    2. Set the Sampling Interval to less than 10ms.
    3. Re-run Offload Modeling.
  • From CLI: Use the --interval option when running the --collect=survey.

Cannot be modeled: Internal Error

Internal Error means incorrect data or lack of data because the Intel Advisor encountered issues when collecting or processing data.

Try to re-run the Offload Modeling perspective to fix the metrics attribution problem. If this does not help, use the Analyzers Community forum for technical support.

Cannot be modeled: System Module

This code region is a system function/loop.

This is not an issue. If this code region is inside an offload region, or a runtime call, its execution time is added to execution time of offloaded regions.

Cannot be modeled: No Execution Count

The Intel Advisor detected no calls to a loop during Trip Count step of the Characterization analysis and no information about execution counts is available for this loop.

Check what loop is executed at this branch.

If you see a wrong loop, try re-running the Offload Modeling to fix the metrics attribution problem.

Less or Equally Profitable Than Children/Parent Offload

This message is not an issue. It means that Intel Advisor has found a more profitable code region to offload. If you still want to see offload estimations for the original code region, use the solutions described in the table below.

Message

Cause and Details

Solution

Less or equally profitable than children offloads

Offloading child loops/functions of this code region is more profitable than offloading the whole region with all its children. This means that the execution Time estimated on a target platform for the region of interest is greater than or equal to the sum of Time values estimated on a target platform for its child regions profitable for offloading.

The following reasons might prevent offloading: total execution time, taxes, trip counts, dependencies.

Model offloading for specific code regions even if they are not profitable. See Enforce Offloading for Specific Loops for details.

Less or equally profitable than parent offload

Offloading a whole parent code region of the region of interest is more profitable than offloading any of its child regions separately. This means that the Time estimated on a target platform for the region of interest is greater than or equal to the Time estimated on the target platform for its parent region.

Offloading a child code region might be limited by high offload taxes.

Solution 1

If you assume the kernel execution should overlap offload taxes, use the --assume-hide-taxes option with --collect=projection action option or the analyze.py script. See Manage Invocation Taxes for details.

Solution 2

Model offloading for only specific code regions even if they are not profitable. See Enforce Offloading for Specific Loops for details.

Not Profitable

Message

Cause and Details

Solution

Not profitable: Parallel execution efficiency is limited due to Dependencies

Dependencies limit parallel execution and the code region cannot benefit from offloading to a target device. The estimated execution time after acceleration is greater than or equal to the original execution time.

Solution 1

Ignore assumed dependencies and model offloading for all or selected code regions:

  • From GUI:
    1. Go to Project Properties > Performance Modeling.
    2. Enter one of the options in the Other Parameters field:

      • --no-assume-dependencies to assume all code regions that do not have information about their dependency are parallel
      • --set-parallel=[<loop-IDs/source-locations>] to ignore dependencies for specified code regions
    3. Re-run Performance Modeling.
  • From CLI: When running --collect=projection or analyze.py, use one of the following:
    • --no-assume-dependencies to ignore dependencies for all code regions
    • --set-parallel=[<loop-IDs/source-locations>] to ignore dependencies for specified code regions

For details, see Check How Dependencies Affect Modeling.

Solution 2

If you did not enable the Dependencies analysis when collecting data, run the analysis as follows to get detailed information about real dependencies in your code:

  • From GUI: Enable the Dependencies and Performance Modeling analyses from the Analysis Workflow pane and re-run the perspective.
  • From CLI: Run the Dependencies analysis with --collect=dependencies and re-run the Performance Modeling with --collect=projection or analyze.py.

See Dependency Type metric description for details.

Not profitable: The Number of Loop Iterations is not enough to fully utilize Target Platform capabilities

The loop cannot benefit from offloading to a target platform as it has a low number of iterations.

In most cases, such code regions cannot benefit from offloading. If you assume that during code migration, the amount of parallel work grows and a loop is broken down into several chunks by a compiler or a program model, use the following workaround:

  • From GUI:
    1. Go to Project Properties > Performance Modeling.
    2. Enter --batching or --threads=<target-threads> in the Other Parameters field. <target-threads> is the number of parallel threads equal to the target device capacity.
    3. Re-run Performance Modeling.
  • From CLI: When running --collect=projection or analyze.py, use one of the following:
    • --batching to model batching-like techniques
    • --threads=<target-threads>, where <target-threads> is the number of parallel threads equal to the target device capacity

If you enable batching, the kernel invocation tax might grow. You can use the --assume-hide-taxes option to reduce the task. See Manage Invocation Taxes for details.

Not profitable: Data Transfer Tax is greater than Computation Time and Memory Bandwidth Time

Time spent on transferring data to a target device is greater than compute time and memory bandwidth time. The resulting time estimated on a target platform with data transfer tax is greater than or equal to the time measured on a host platform.

Check the Bounded By and Data Transfer Tax columns in the Estimated Bounded By column group and the Estimated Data Transfer with Reuse column group. Large value means that this code region cannot benefit from offloading.

See Bounded By for details about metric interpretation.

If you still want to offload such regions, disable data transfer analysis with the --data-transfer=off to use only estimated execution time for speedup and profitability calculation.

NOTE:
This option disables data transfer analysis for all loops. You might get different performance modeling results for all loops.

If you already collected data transfer metrics, you can turn off modeling data transfer tax with the command line option --hide-data-transfer-tax.

Not profitable: Computation Time is high despite the full use of Target Platform capabilities

The code region uses full target platform capabilities, but time spent for compute operations is still high. As a result, the execution time estimated on a target platform is greater than or equal to the time measured on a host platform.

Check the value in the Compute column in the Estimated Bound-by column group. Unexpectedly high value means one of the following:

  • There is a problem with a programming model used.
  • Target GPU compute capabilities are lower than baseline CPU compute capabilities.
  • Internal Intel Advisor error happened caused by incorrect compute time estimation.

Not profitable: Cache/Memory Bandwidth Time is greater than other execution time components on Target Device

The time spent in cache or memory bandwidth takes a big part of the time estimated on a target platform. As a result, it is greater than or equal to the time measured on a host platform.

In the report, the Cache/Memory is replaced with a specific cache or memory level that prevents offloading, for example, L3 or LLC. See the Throughput column for details about the highest bandwidth time.

  1. Examine code region children to identify which part takes most of the time and prevents offloading.
  2. Optimize the part of your code that takes most of the time measured on a baseline platform and rerun the perspective.

Not profitable because of offload overhead (taxes)

Total time of offload taxes, which includes Kernel Launch Tax, Data Transfer Tax, takes a big part of the time estimated on a target platform. As a result, it is greater than or equal to the time measured on a host platform.

Examine the Taxes with Reuse column in the Estimated Bounded by group for the biggest and total time taxes paid for offloading the code region to a target platform. Expand the Estimated Bounded by group to see a full picture of time taxes paid for offloading the region to the target platform. Big value in any of the columns means that this code region cannot benefit from offloading because the cost of offloading is high.

If kernel launch tax is large and you assume the kernel execution should overlap the launch tax, model hiding the launch taxes as follows:

  • From GUI: Enabled the Single Kernel Launch Tax option from the Analysis Workflow pane and rerun the Performance Modeling analysis.
  • From CLI: Use the --assume-hide-taxes option with the --collect=projection or analyze.py

See Manage Invocation Taxes for details.

Not profitable: Kernel Launch Tax is greater than Kernel Execution Time and Data Transfer Time

Time spent on launching a kernel is greater than execution time estimated on a target platform and estimated data transfer time. The resulting time estimated on the target platform with data transfer tax is greater than or equal to the time measured on a host platform.

Examine the Bounded By and Kernel Launch Tax columns in the Estimated Bounded By column group.

See Bounded By for details about metric interpretation.

High value in Kernel Launch Tax means that the Intel Advisor detects high call count for a potentially profitable code region and assumes that the kernel invocation tax is paid as many times as the kernel is launched. For this reason, it assumes that the code region cannot benefit from offloading.

If you assume the kernel execution should overlap the launch tax, model hiding the launch taxes as follows:

  • From GUI: Select the Single Kernel Launch Tax checkbox for the Performance Modeling analysis.
  • From CLI: Use the --assume-hide-taxes option with the --collect=projection action option or analyze.py.

For details, see Manage Invocation Taxes.

Not profitable: Atomic Throughput Time is greater than other execution time components on a Target Device

Atomic operations include loading, changing, and storing data to make sure it is not affected by other threads between the calls.

When modeling atomic operations, Intel Advisor assumes that all threads wait for each other, so Atomic Throughput time might be high and can be one of the main hotspots.

Go to the Analyzers Community forum for technical support and advice.

Not profitable: Instruction Latency is greater than Compute Time and Memory Bandwidth Time

Each memory read instruction produces a GPU thread stall. The stall is called a memory latency. Usually, execution of other threads can overlap it.

However, sometimes the amount of non-overlapped latency has a big impact on performance. Intel Advisor can estimate the non-overlapped memory latency and add it to the kernel estimated execution time.

If you reduce thread occupancy, it can increase the amount of non-overlapped memory latency

Examine the Latency column to see how much time spent for load latency and the Thread Occupancy column to understand the reason for this. Low occupancy means that this is the reason for a high load latency. In this case, when offloading the code, increase the kernel parallelism or cover latency with other instructions.

If you are sure that the load latency is overlapped with compute instructions in your code, you can enable latency hiding mode with the following:

  • From GUI:

    1. Go to Project Properties > Performance Modeling.
    2. Enter --count-send-latency=first in the Other Parameters field.

    3. Re-run Performance Modeling.
  • From CLI: Use the --count-send-latency=first option with the --collect=projection action option or analyze.py.