Explore Performance Gain from GPU-to-GPU Modeling
- Review the result summary and a result file location printed to a command prompt or a terminal.
- Review the project result in Intel® Advisor graphical user interface (GUI) generated to the project directory.
- Review HTML reports generated to thedirectory.<project-dir>/e<NNN>/report
- Review a set of CSV reports with detailed metric tables generated to thedirectory.<project-dir>/e<NNN>/pp<NNN>/data.0
- In theProgram Metricspane, compare theTime on Baseline GPUandTime on Target GPUand examine theSpeedup for Accelerated Codeto understand if the GPU kernels in your application have a better performance on a target GPU.Time on Baseline GPUincludesonlyexecution time of the GPU kernels and ignores the CPU parts of your application.Time on Target GPUincludes estimated execution time for GPU kernels on the target GPU and offload taxes.In the pie chart, review ratio of GPU execution time and offload taxes (kernel launch tax and data transfer tax) and see where the GPU kernels spend most of the time.
- In theOffloads Bounded bypane, examine what the GPU kernels are potentially bounded by on the target GPU. The parameters with the highest percentage mean that this is where the GPU kernels spend the most time. Review the detailed metrics for these parameters in other tabs to understand if you need to optimize your application for this.
- In theTop offloadedpane, review the top five GPU kernels with the highest absolute offload gain (in seconds) estimated on the target GPU. The gain is calculated as(Time measured on the baseline GPU - Time estimated on the target GPU).For each kernel in the pane, you can review the speedup, time on the baseline and the target GPUs, main bounded-by parameters, and estimated amount of data transferred. Intel Advisor models kernels one-to-one and does not filter out kernels with estimated speedup less than 1.
- In theCode Regionstable, examine the detailed performance metrics for the GPU kernels. TheMeasuredcolumn group shows metrics measured on the baseline GPU. Other column groups show metrics estimated for the target GPU. You can expand column groups to see more metrics.You can also select a kernel in the table and examine the highlight measured and estimated metrics for it in the Details tab of the right-side pane to identify what you need to focus on.For example, to find a potential bottleneck:
You can also review the following data to find bottlenecks:
- Examine theEstimated Bounded bycolumn group focusing on theBounded byandThroughputcolumns. In theBounded bycolumn, you can see the main bottleneck and secondary bottlenecks. TheThroughputexpands the bottlenecks with time by compute or memory throughput, latencies, and offload taxes shown as a chart. See Bounded By for bottleneck details.
- For details about the bounding factors, expand the column group and find the columns corresponding to these bounding factors, for example,L3 Cache BW,DRAM BW, orLLC BW.
- Scroll to the right, expand theMemory Estimationscolumn group, and examine the columns corresponding to the bottleneck identified. For example, the bandwidth utilization is calculated as a relation of average memory level bandwidth to its peak bandwidth. High value means that the kernel does not use well this memory level and it is the potential bottleneck.
- If you see high cache or memory bandwidth utilization (for example, in theL3 Cache,SLM,LLCcolumn groups), consider optimizing cache/memory traffic to improve performance.
- If you see high latency in theEstimated Bounded Bycolumn group, consider optimizing cache/memory latency by scheduling enough parallel work for this kernel to increase thread occupancy.
- If you see high data transfer tax in theEstimated Data Transfer with Reuse, consider optimizing data transfer taxes or using unified shared memory (USM).
- If you see a high data transfer tax for a kernel, select the kernel in the Code Regions table and examine the details about memory objects transferred between the host device and a target GPU for a kernel in the right-sideData Transfer Estimationspane. Review the following data:
Intel Advisor uses this data to estimate data transfer traffic and data transfers for each kernel.
- The histogram for the transferred data that shows amount of data transferred in each direction and the corresponding offload taxes.
- The memory object table that lists all memory objects accessed by the kernel with details about each object, such as size, transfer direction (only to the host, only to the target, from the host to the target and back), object type. If you see a lot of small-sized objects, this may result in high latency for the kernel. High latency might cause a high data transfer tax.
- Hints about optimizing data transfers in the selected code region.