Investigate Non-Offloaded Code Regions
The modeling step analyzes code regions profitability for offloading to a target device. Some regions might be not profitable for offloading or cannot be modeled.
If you view a result in the
Intel® Advisor GUI
: To see details why your code region of interest is reported as not recommended for offloading, select a loop in a
Code Regions
pane and see the
Details
tab in the right-side pane for detailed loop information, including the reason why the loop is not recommended for offloading.
By default, the report shows all code regions. You can apply filters to see only regions recommended or not recommended for offloading: open the
drop-down list and select the desired filter option.

If you view the result in an
Offload Modeling HTML report
: Go to the
Non-Offloaded Regions
tab and examine the
Why Not Offloaded
column in the
Offload Information
group to the reason why a code region is not recommended for offloading to a selected target platform.
For each region not recommended for offloading, you can force offload modeling. See the
Enforce Offloading for Specific Loops.
Cannot Be Modeled
Message | Cause and Details | Solution |
---|---|---|
Cannot be modeled: Outside of Marked Region
| Intel® Advisor cannot model performance for a code region because it is not marked up for analysis.
| Make sure a code region satisfies all markup rules or use a different markup strategy:
|
Cannot be modeled: Not Executed
| A code region is in the call tree, but the
Intel Advisor detected no calls to it for a dataset used during Survey.
| This can happen if execution time of the loop is very small and close to the sampling interval of the
Intel Advisor . Such loops can have significant inaccuracies in time measurement. By default, the sampling interval for Survey is 0.01 seconds.
You can try to decrease the sampling interval of the
Intel Advisor :
|
Cannot be modeled: Internal Error
| Internal Error means incorrect data or lack of data because the
Intel Advisor encountered issues when collecting or processing data.
| Try to re-run the
Offload Modeling perspective to fix the metrics attribution problem. If this does not help, use the
Analyzers Community forum for technical support.
|
Cannot be modeled: System Module
| This code region is a system function/loop.
| This is not an issue. If this code region is inside an offload region, or a runtime call, its execution time is added to execution time of offloaded regions.
|
Cannot be modeled: No Execution Count
| The
Intel Advisor detected no calls to a loop during Trip Count step of the Characterization analysis and no information about execution counts is available for this loop.
| Check what loop is executed at this branch.
If you see a wrong loop, try re-running the
Offload Modeling to fix the metrics attribution problem.
|
Less or Equally Profitable Than Children/Parent Offload
This message is not an issue. It means that
Intel Advisor
has found a more profitable code region to offload. If you still want to see offload estimations for the original code region, use the solutions described in the table below.
Message | Cause and Details | Solution |
---|---|---|
Less or equally profitable than children offloads
| Offloading child loops/functions of this code region is more profitable than offloading the whole region with all its children. This means that the execution
Time estimated on a target platform for the region of interest is greater than or equal to the sum of
Time values estimated on a target platform for
its child regions profitable for offloading.
The following reasons might prevent offloading: total execution time, taxes, trip counts, dependencies.
| Model offloading for specific code regions even if they are not profitable. See
Enforce Offloading for Specific Loops for details.
|
Less or equally profitable than parent offload
| Offloading a whole parent code region of the region of interest is more profitable than offloading any of its child regions separately. This means that the
Time estimated on a target platform for the region of interest is greater than or equal to the
Time estimated on the target platform for
its parent region .
Offloading a child code region might be limited by high offload taxes.
| Solution 1 If you assume the kernel execution should overlap offload taxes, use the
--assume-hide-taxes option with
--collect=projection action option or the
analyze.py script. See
Manage Invocation Taxes for details.
Solution 2 Model offloading for only specific code regions even if they are not profitable. See
Enforce Offloading for Specific Loops for details.
|
Not Profitable
Message | Cause and Details | Solution |
---|---|---|
Not profitable: Parallel execution efficiency is limited due to Dependencies
| Dependencies limit parallel execution and the code region cannot benefit from offloading to a target device. The estimated execution time after acceleration is greater than or equal to the original execution time.
| Solution 1 Ignore assumed dependencies and model offloading for all or selected code regions:
For details, see
Check How Dependencies Affect Modeling.
Solution 2 If you did not enable the
Dependencies analysis when collecting data, run the analysis as follows to get detailed information about real dependencies in your code:
See
Dependency Type metric description for details.
|
Not profitable: The Number of Loop Iterations is not enough to fully utilize Target Platform capabilities
| The loop cannot benefit from offloading to a target platform as it has a low number of iterations.
| In most cases, such code regions cannot benefit from offloading. If you assume that during code migration, the amount of parallel work grows and a loop is broken down into several chunks by a compiler or a program model, use the following workaround:
If you enable batching, the kernel invocation tax might grow. You can use the
--assume-hide-taxes option to reduce the task. See
Manage Invocation Taxes for details.
|
Not profitable: Data Transfer Tax is greater than Computation Time and Memory Bandwidth Time
| Time spent on transferring data to a target device is greater than
compute time and
memory bandwidth time . The resulting time estimated on a target platform with data transfer tax is greater than or equal to the time measured on a host platform.
| Check the
Bounded By and
Data Transfer Tax columns in the
Estimated Bounded By column group and the
Estimated Data Transfer with Reuse column group. Large value means that this code region cannot benefit from offloading.
See
Bounded By for details about metric interpretation.
If you still want to offload such regions, disable data transfer analysis with the
--data-transfer=off to use only estimated execution time for speedup and profitability calculation.
This option disables data transfer analysis for all loops. You might get different performance modeling results for all loops.
If you already collected data transfer metrics, you can turn off modeling data transfer tax with the command line option
--hide-data-transfer-tax .
|
Not profitable: Computation Time is high despite the full use of Target Platform capabilities
| The code region uses full target platform capabilities, but time spent for compute operations is still high. As a result, the execution time estimated on a target platform is greater than or equal to the time measured on a host platform.
| Check the value in the
Compute column in the
Estimated Bound-by column group. Unexpectedly high value means one of the following:
|
Not profitable:
Cache/Memory Bandwidth Time is greater than other execution time components on Target Device
| The time spent in
cache or memory bandwidth takes a big part of the time estimated on a target platform. As a result, it is greater than or equal to the time measured on a host platform.
In the report, the
Cache/Memory is replaced with a specific cache or memory level that prevents offloading, for example, L3 or LLC. See the
Throughput column for details about the highest bandwidth time.
|
|
Not profitable because of offload overhead (taxes)
| Total time of offload taxes, which includes
Kernel Launch Tax ,
Data Transfer Tax , takes a big part of the time estimated on a target platform. As a result, it is greater than or equal to the time measured on a host platform.
| Examine the
Taxes with Reuse column in the
Estimated Bounded by group for the biggest and total time taxes paid for offloading the code region to a target platform. Expand the
Estimated Bounded by group to see a full picture of time taxes paid for offloading the region to the target platform. Big value in any of the columns means that this code region cannot benefit from offloading because the cost of offloading is high.
If kernel launch tax is large and you assume the kernel execution should overlap the launch tax, model hiding the launch taxes as follows:
See
Manage Invocation Taxes for details.
|
Not profitable: Kernel Launch Tax is greater than Kernel Execution Time and Data Transfer Time
| Time spent on launching a kernel is greater than execution time estimated on a target platform and estimated data transfer time. The resulting time estimated on the target platform with data transfer tax is greater than or equal to the time measured on a host platform.
| Examine the
Bounded By and
Kernel Launch Tax columns in the
Estimated Bounded By column group.
See
Bounded By for details about metric interpretation.
High value in
Kernel Launch Tax means that the
Intel Advisor detects high call count for a potentially profitable code region and assumes that the kernel invocation tax is paid as many times as the kernel is launched. For this reason, it assumes that the code region cannot benefit from offloading.
If you assume the kernel execution should overlap the launch tax, model hiding the launch taxes as follows:
For details, see
Manage Invocation Taxes.
|
Not profitable: Atomic Throughput Time is greater than other execution time components on a Target Device
| Atomic operations include loading, changing, and storing data to make sure it is not affected by other threads between the calls.
When modeling atomic operations, Intel Advisor assumes that
all threads wait for each other, so
Atomic Throughput time might be high and can be one of the main hotspots.
| Go to the
Analyzers Community forum for technical support and advice.
|
Not profitable: Instruction Latency is greater than Compute Time and Memory Bandwidth Time
| Each memory read instruction produces a GPU thread stall. The stall is called a
memory latency . Usually, execution of other threads can overlap it.
However, sometimes the amount of non-overlapped latency has a big impact on performance.
Intel Advisor can estimate the non-overlapped memory latency and add it to the kernel estimated execution time.
If you reduce thread occupancy, it can increase the amount of non-overlapped memory latency
| Examine the
Latency column to see how much time spent for load latency and the
Thread Occupancy column to understand the reason for this. Low occupancy means that this is the reason for a high load latency. In this case, when offloading the code, increase the kernel parallelism or cover latency with other instructions.
If you are sure that the load latency is overlapped with compute instructions in your code, you can enable latency hiding mode with the following:
|
N/A - Part of Offload
This means that offloading a code region is less profitable than offloading its outer loop.
This is not an issue. The code region of interest is located inside of an offloaded loop.
Total Time Is Too Small for Reliable Modeling
This means the execution time of a code region or a whole loop nest is less than 0.02 seconds. In this case,
Intel Advisor
cannot estimate the speedup correctly and say if it is worth to offload the code regions because its execution time is close to the sampling interval of the
Intel Advisor
.
Possible Solution
If you want to check the profitability of offloading code regions with total time less than 0.02 seconds:
- From GUI:
- Go to.
- Enter the--loop-filter-threshold=0option to theOther parametersfield to model such small offloads.
- Re-run Performance Modeling.
- From CLI: Use the--loop-filter-threshold=0option with the--collect=projectionoranalyze.py.