User Guide


Examine Data Transfers for Modeled Regions

Accuracy Level


Enabled Analyses

Survey + Characterization (Trip Counts and FLOP with cache simulation and light data transfer simulation) + Performance Modeling with no assumed dependencies

Result Interpretation

After running the
Offload Modeling
perspective with
accuracy, you get an extended
Offload Modeling
report, which provides information about memory and cache usage and taxes of your offloaded application. In addition to the basic data, the result includes:
  • More accurate estimations of traffic and time for all cache and memory levels.
  • Measured data transfer and estimated data transfer between host and device memory.
  • Total data for the loop/function from different callees.
When profiling a GPU application with
data transfer simulation mode, you will get memory traffic estimation only for CPU code.
Offload Modeling
perspective assumes a loop is parallel if its dependency type is unknown. It means that there is no information about a loop from a compiler or the loop is not explicitly marked as parallel, for example, with a programming model (OpenMP*, Data Parallel C++,
Intel® oneAPI Threading Building Blocks
If you had a report generated for a lower accuracy, all offload recommendations, metrics, and speed-up will be updated to be more precise taking into account new data.
This topic describes data as it is shown in the
Offload Modeling
report in the Intel Advisor GUI and an interactive HTML report.
In the
Accelerated Regions
tab of the
Offload Modeling
report, review the metrics about memory usage and data transfers.
Example of an Accelerated Regions report with data transfer and tax estimations (Offload Modeling perspective)
  • In the
    Code Regions
    metrics table:
    • In the
      Estimated Bounded By
      column, review how much time is spent to transfer data (data transfer tax). In the
      Taxes with Reuse
      column, see the biggest and total time taxes paid for offloading a code regions to a target platform.
      Expand the
      Estimated Bounded By
      group to see a full picture of all time taxes paid for offloading the region to the target platform.
    • In the
      Estimated Data Transfer with Reuse
      column, review how much data is transferred per kernel in different directions (from host to device, from device to host). Expand the column to see data per memory level.
    • In the
      Memory Estimations
      column, see how well your application uses resources of all memory levels. Expand the group to see more detailed and accurate metrics for different memory levels.
      Examine Estimated Bounded By, Estimated Data Transfer woth Reuse, Memory Estimations columns to learn about per-kernel data transfers
  • Select a code region from the table and review the details about data transferred between host and device memory in the
    Data Transfer Estimations
    • In the
      Transferred Data & Tax
      histogram, see the distribution of data transferred between the host and target devices in each direction.
    • See hints about optimizing data transfers in the selected code region.
      Examine data transferred in each direction and data transfer hints in the Data Transfer Estimations pane
  • In the
    tab, get guidance for offloading your code to a target device and optimizing it so that your code benefits the most. If the code region has room for optimization or underutilizes the capacity of the target device,
    Intel Advisor
    provides you with hints and code snippets that might be helpful to you for further code improvement.
For details about metrics reported, see Accelerator Metrics.

Next Steps

To learn more about data transfers estimated between host and target device for your application, run
Offload Modeling
with one the following properties:
  • Set the data transfer simulation under the characterization analysis to
    and run the perspective. The result should have the
    Data Transfer Estimations
    pane extended with new data reporting information about memory objects in each code region.
    Offloaded Objects
    pane shows a list of memory objects with data about each object aggregated between different instances of one region.
    Examine detected memory objects in the Offloaded Objects pane
    histogram shows the number of memory objects that the selected region accessed distributed by their size.
    Examine memory object size histogram in the Analytics pane
  • Set the data transfer simulation under the characterization analysis to
    and enable the
    Data Reuse Analysis
    checkbox under the Performance Modeling analysis. With data reuse analysis,
    Intel Advisor
    detects groups of parallel code regions that can reuse memory objects transferred to a target GPU device. Such memory objects can be transferred to GPU only once and reused, which can improve data transfer efficiency.
    The result should have data transfer metrics in the
    Code Regions
    pane estimated with and without data reuse for each code region. Examine the metrics in the
    Estimated Bounded By
    Estimated Data Transfer with Reuse
    columns to check if a code region can benefit from applying data reuse.
    For code regions that can benefit from data reuse, you should see
    Apply Data Reuse
    guidance in the
    tab. The guidance shows the data transfer estimated with and without data reuse and the performance gain from applying the data reuse. It also explains how you can apply the data reuse technique to your code.
    Examine the data reuse recommendation in the Recommendations tab
  • If you think that the estimated speedup is enough and the application is ready to be offloaded, rewrite your code to offload profitable code regions to a target platform and measure performance of GPU kernels with GPU Roofline Insights perspective.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at