Examine Data Transfers for Modeled Regions
Accuracy Level
Medium
Enabled Analyses
Survey + Characterization (Trip Counts and FLOP with cache simulation and light data transfer simulation) + Performance Modeling with no assumed dependencies
Result Interpretation
After running the
Offload Modeling
perspective with
Medium
accuracy, you get an extended
Offload Modeling
report, which provides information about memory and cache usage and taxes of your offloaded application. In addition to the basic data, the result includes:
- More accurate estimations of traffic and time for all cache and memory levels.
- Measured data transfer and estimated data transfer between host and device memory.
- Total data for the loop/function from different callees.
When profiling a GPU application with
Light
data transfer simulation mode, you will get memory traffic estimation only for CPU code.
Offload Modeling
perspective assumes a loop is parallel if its dependency type is unknown. It means that there is no information about a loop from a compiler or the loop is not explicitly marked as parallel, for example, with a programming model (OpenMP*, SYCL,
Intel® oneAPI Threading Building Blocks
).
If you had a report generated for a lower accuracy, all offload recommendations, metrics, and speed-up will be updated to be more precise taking into account new data.
This topic describes data as it is shown in the
Offload Modeling
report in the Intel Advisor GUI and an interactive HTML report.
In the
Accelerated Regions
tab of the
Offload Modeling
report, review the metrics about memory usage and data transfers.

- In theCode Regionsmetrics table:
- In theEstimated Bounded Bycolumn, review how much time is spent to transfer data (data transfer tax). In theTaxes with Reusecolumn, see the biggest and total time taxes paid for offloading a code regions to a target platform.Expand theEstimated Bounded Bygroup to see a full picture of all time taxes paid for offloading the region to the target platform.
- In theEstimated Data Transfer with Reusecolumn, review how much data is transferred per kernel in different directions (from host to device, from device to host). Expand the column to see data per memory level.
- In theMemory Estimationscolumn, see how well your application uses resources of all memory levels. Expand the group to see more detailed and accurate metrics for different memory levels.
- Select a code region from the table and review the details about data transferred between host and device memory in theData Transfer Estimationspane.
- In theTransferred Data & Taxhistogram, see the distribution of data transferred between the host and target devices in each direction.
- See hints about optimizing data transfers in the selected code region.
- In theRecommendationstab, get guidance for offloading your code to a target device and optimizing it so that your code benefits the most. If the code region has room for optimization or underutilizes the capacity of the target device,Intel Advisorprovides you with hints and code snippets that might be helpful to you for further code improvement.
For details about metrics reported, see
Accelerator Metrics.
Next Steps
To learn more about data transfers estimated between host and target device for your application, run
Offload Modeling
with one the following properties:
- Set the data transfer simulation under the characterization analysis toMediumand run the perspective. The result should have theData Transfer Estimationspane extended with new data reporting information about memory objects in each code region.Offloaded Objectspane shows a list of memory objects with data about each object aggregated between different instances of one region.Analyticshistogram shows the number of memory objects that the selected region accessed distributed by their size.
- Set the data transfer simulation under the characterization analysis toHighand enable theData Reuse Analysischeckbox under the Performance Modeling analysis. With data reuse analysis,Intel Advisordetects groups of parallel code regions that can reuse memory objects transferred to a target GPU device. Such memory objects can be transferred to GPU only once and reused, which can improve data transfer efficiency.The result should have data transfer metrics in theCode Regionspane estimated with and without data reuse for each code region. Examine the metrics in theEstimated Bounded ByandEstimated Data Transfer with Reusecolumns to check if a code region can benefit from applying data reuse.For code regions that can benefit from data reuse, you should seeApply Data Reuseguidance in theRecommendationstab. The guidance shows the data transfer estimated with and without data reuse and the performance gain from applying the data reuse. It also explains how you can apply the data reuse technique to your code.
- If you think that the estimated speedup is enough and the application is ready to be offloaded, rewrite your code to offload profitable code regions to a target platform and measure performance of GPU kernels with GPU Roofline Insights perspective.