Investigate Non-Offloaded Code Regions
Cannot Be Modeled
Cause and Details
Cannot be modeled: Outside of Marked Region
Intel® Advisorcannot model performance for a code region because it is not marked up for analysis.
Make sure a code region satisfies all markup rules or use a different markup strategy:
Cannot be modeled: Not Executed
A code region is in the call tree, but the
Intel Advisordetected no calls to it for a dataset used during Survey.
This can happen if execution time of the loop is very small and close to the sampling interval of the
Intel Advisor. Such loops can have significant inaccuracies in time measurement. By default, the sampling interval for Survey is 0.01 seconds.
You can try to decrease the sampling interval of the
Cannot be modeled: Internal Error
Internal Errormeans incorrect data or lack of data because the
Intel Advisorencountered issues when collecting or processing data.
Cannot be modeled: System Module
This code region is a system function/loop.
This is not an issue. If this code region is inside an offload region, or a runtime call, its execution time is added to execution time of offloaded regions.
Cannot be modeled: No Execution Count
Intel Advisordetected no calls to a loop during Trip Count step of the Characterization analysis and no information about execution counts is available for this loop.
Check what loop is executed at this branch.
If you see a wrong loop, try re-running the
Offload Modelingto fix the metrics attribution problem.
Less or Equally Profitable Than Children/Parent Offload
Cause and Details
Less or equally profitable than children offloads
Offloading child loops/functions of this code region is more profitable than offloading the whole region with all its children. This means that the execution
Timeestimated on a target platform for the region of interest is greater than or equal to the sum of
Timevalues estimated on a target platform for
its child regionsprofitable for offloading.
The following reasons might prevent offloading: total execution time, taxes, trip counts, dependencies.
Less or equally profitable than parent offload
Offloading a whole parent code region of the region of interest is more profitable than offloading any of its child regions separately. This means that the
Timeestimated on a target platform for the region of interest is greater than or equal to the
Timeestimated on the target platform for
its parent region.
Offloading a child code region might be limited by high offload taxes.
If you assume the kernel execution should overlap offload taxes, use the
--collect=projectionaction option or the
analyze.pyscript. See Manage Invocation Taxes for details.
Model offloading for only specific code regions even if they are not profitable. See Enforce Offloading for Specific Loops for details.
Cause and Details
Not profitable: Parallel execution efficiency is limited due to Dependencies
Dependencies limit parallel execution and the code region cannot benefit from offloading to a target device. The estimated execution time after acceleration is greater than or equal to the original execution time.
Ignore assumed dependencies and model offloading for all or selected code regions:
For details, see Check How Dependencies Affect Modeling.
If you did not enable the Dependencies analysis when collecting data, run the analysis as follows to get detailed information about real dependencies in your code:
See Dependency Type metric description for details.
Not profitable: The Number of Loop Iterations is not enough to fully utilize Target Platform capabilities
The loop cannot benefit from offloading to a target platform as it has a low number of iterations.
In most cases, such code regions cannot benefit from offloading. If you assume that during code migration, the amount of parallel work grows and a loop is broken down into several chunks by a compiler or a program model, use the following workaround:
If you enable batching, the kernel invocation tax might grow. You can use the
--assume-hide-taxesoption to reduce the task. See Manage Invocation Taxes for details.
Not profitable: Data Transfer Tax is greater than Computation Time and Memory Bandwidth Time
Time spent on transferring data to a target device is greater than
memory bandwidth time. The resulting time estimated on a target platform with data transfer tax is greater than or equal to the time measured on a host platform.
Data Transfer Taxcolumns in the
Estimated Bounded Bycolumn group and the
Estimated Data Transfer with Reusecolumn group. Large value means that this code region cannot benefit from offloading.
If you still want to offload such regions, disable data transfer analysis with the
--data-transfer=offto use only estimated execution time for speedup and profitability calculation.
This option disables data transfer analysis for all loops. You might get different performance modeling results for all loops.
If you already collected data transfer metrics, you can turn off modeling data transfer tax with the command line option
Not profitable: Computation Time is high despite the full use of Target Platform capabilities
The code region uses full target platform capabilities, but time spent for compute operations is still high. As a result, the execution time estimated on a target platform is greater than or equal to the time measured on a host platform.
Check the value in the
Computecolumn in the
Estimated Bound-bycolumn group. Unexpectedly high value means one of the following:
Cache/MemoryBandwidth Time is greater than other execution time components on Target Device
The time spent in
cache or memory bandwidthtakes a big part of the time estimated on a target platform. As a result, it is greater than or equal to the time measured on a host platform.
In the report, the
Cache/Memoryis replaced with a specific cache or memory level that prevents offloading, for example, L3 or LLC. See the
Throughputcolumn for details about the highest bandwidth time.
Not profitable because of offload overhead (taxes)
Total time of offload taxes, which includes
Kernel Launch Tax,
Data Transfer Tax, takes a big part of the time estimated on a target platform. As a result, it is greater than or equal to the time measured on a host platform.
Taxes with Reusecolumn in the
Estimated Bounded bygroup for the biggest and total time taxes paid for offloading the code region to a target platform. Expand the
Estimated Bounded bygroup to see a full picture of time taxes paid for offloading the region to the target platform. Big value in any of the columns means that this code region cannot benefit from offloading because the cost of offloading is high.
If kernel launch tax is large and you assume the kernel execution should overlap the launch tax, model hiding the launch taxes as follows:
See Manage Invocation Taxes for details.
Not profitable: Kernel Launch Tax is greater than Kernel Execution Time and Data Transfer Time
Time spent on launching a kernel is greater than execution time estimated on a target platform and estimated data transfer time. The resulting time estimated on the target platform with data transfer tax is greater than or equal to the time measured on a host platform.
Kernel Launch Taxcolumns in the
Estimated Bounded Bycolumn group.
High value in
Kernel Launch Taxmeans that the
Intel Advisordetects high call count for a potentially profitable code region and assumes that the kernel invocation tax is paid as many times as the kernel is launched. For this reason, it assumes that the code region cannot benefit from offloading.
If you assume the kernel execution should overlap the launch tax, model hiding the launch taxes as follows:
For details, see Manage Invocation Taxes.
Not profitable: Atomic Throughput Time is greater than other execution time components on a Target Device
Atomic operations include loading, changing, and storing data to make sure it is not affected by other threads between the calls.
When modeling atomic operations, Intel Advisor assumes that
allthreads wait for each other, so
Atomic Throughputtime might be high and can be one of the main hotspots.
Not profitable: Instruction Latency is greater than Compute Time and Memory Bandwidth Time
Each memory read instruction produces a GPU thread stall. The stall is called a
memory latency. Usually, execution of other threads can overlap it.
However, sometimes the amount of non-overlapped latency has a big impact on performance.
Intel Advisorcan estimate the non-overlapped memory latency and add it to the kernel estimated execution time.
If you reduce thread occupancy, it can increase the amount of non-overlapped memory latency
Latencycolumn to see how much time spent for load latency and the
Thread Occupancycolumn to understand the reason for this. Low occupancy means that this is the reason for a high load latency. In this case, when offloading the code, increase the kernel parallelism or cover latency with other instructions.
If you are sure that the load latency is overlapped with compute instructions in your code, you can enable latency hiding mode with the following:
N/A - Part of Offload
Total Time Is Too Small for Reliable Modeling
- From GUI:
- Go to.
- Enter the--loop-filter-threshold=0option to theOther parametersfield to model such small offloads.
- Re-run Performance Modeling.
- From CLI: Use the--loop-filter-threshold=0option with the--collect=projectionoranalyze.py.