Get Started with Intel® Advisor

ID 766450
Date 3/31/2023
Public

Identify High-impact Opportunities to Offload to GPU

With the Offload Modeling perspective of the Intel® Advisor, you can model performance of your application on a target graphics processing unit (GPU) device and identify code regions that are the most profitable to run on the target GPU.

Offload Modeling perspective has two workflows:

  • With CPU-to-GPU modeling, you can profile an application running on a CPU and model its performance on a target GPU device to determine if you should offload parts of your application to the GPU.
  • With GPU-to-GPU modeling, you can profile an application running on a GPU and model its performance on a different GPU device to estimate a potential speedup from running your application on the different target.

This page explains how to profile a vector-add sample with CPU-to-GPU modeling and estimate its speedup on a target GPU.

Follow the steps:

  1. Build your application.
  2. Model application performance running on a GPU.
  3. Examine application performance estimated on the GPU.

Prerequisites

  1. Download the vector-add code sample from oneAPI samples GitHub* repository.

    You can also use your own application to follow the instructions below.

  2. Install the Intel Advisor as a standalone or as part of Intel® oneAPI Base Toolkit. For installation instructions, see Install Intel Advisor in the user guide.
  3. Install the Intel® oneAPI DPC++/C++ Compiler as a standalone or as part of Intel® oneAPI Base Toolkit. For installation instructions, see Intel® oneAPI Toolkits Installation Guide.
  4. Set up environment variables for the Intel Advisor and Intel oneAPI DPC++/C++ Compiler. For example, run the setvars script in the installation directory.

    This document assumes you installed the tools to a default location. If you installed the tools to a different location, make sure to replace the default path in the commands below.

    IMPORTANT:
    Do not close the terminal or command prompt after setting the environment variables. Otherwise, the environment resets.

Build Your Application

On Linux* OS

From the terminal where you set the environment variables:

  1. Navigate to the vector-add sample directory.
  2. Compile the application using the following command:
    icpx -g -fiopenmp -fopenmp-targets=spir64 -fsycl src/vector-add-buffers
  3. Run the application with a vector size 100000000 to verify the build as follows:
    ./vector-add-buffers 100000000

    If the application is built successfully, you should see the output similar to the following:

    Vector size: 100000000
    [0]: 0 + 0 = 0
    [1]: 1 + 1 = 2
    [2]: 2 + 2 = 4
    ...
    [99999999]: 99999999 + 99999999 = 199999998
    Vector add successfully completed on device.

    The vector-add-buffers application uses SYCL and runs on a GPU by default. For the workflow in the topic, you should temporarily offload it to a CPU to analyze as described in the section below.

On Windows* OS

From the command prompt where you set the environment variables:

  1. Navigate to the vector-add sample directory.
  2. Compile the application using the following command:
    icx-cl -fsycl /O2 /EHsc /Zi -o vector-add-buffers.exe src/vector-add-buffers.cpp
  3. Run the application with a vector size 100000000 to verify the build as follows:
    vector-add-buffers.exe 100000000

    If the application is built successfully, you should see the output similar to the following:

    Vector size: 100000000
    [0]: 0 + 0 = 0
    [1]: 1 + 1 = 2
    [2]: 2 + 2 = 4
    ...
    [99999999]: 99999999 + 99999999 = 199999998
    Vector add successfully completed on device.

    The vector-add-buffers application uses SYCL and runs on a GPU by default. For the workflow in the topic, you should temporarily offload it to a CPU to analyze as described in the section below.

Model Application Performance Running on a GPU

Run Offload Modeling from Graphical User Interface (GUI)

  1. From the terminal or command prompt where you set the environment variables, launch the Intel Advisor GUI:
    advisor-gui
  2. Create a vector-add project for the vector-add-buffers application. Follow the instructions in the Before You Begin page. When you have the Project Properties dialog box open:
    1. Go to the Analysis Target tab > Survey Analysis Types > Survey Hotspots Analysis.
    2. Click Browse… near the Application field and navigate to the vector-add-buffers application. Click Open.
    3. To set the vector size, in the Application parameters field, type 100000000.
    4. To offload the application to a CPU temporarily, click Modify… near the User-defined environment variables field. The User-defined Environment Variables dialog box opens.
    5. Click the empty line in the Variable column and enter variable name SYCL_DEVICE_FILTER.
    6. Click the empty line in the Value column and enter variable value opencl:cpu.
    7. Click OK to save the changes.
    8. Go to the Analysis Target tab > Survey Analysis Types > Trip Counts and FLOP Analysis and make sure the Inherit Settings from Survey Hotspots Analysis Type checkbox is selected.
    9. Click OK to create the project.
  3. In the Perspective Selector window, choose the Offload Modeling perspective.
  4. In the Analysis Workflow pane, select the following parameters:
    1. Make sure the baseline device is set to CPU.
    2. Set accuracy to Medium.
    3. Select the Gen11 GT2 target device.
  5. Click button to run the perspective.

    During the perspective execution with medium accuracy, Intel Advisor:

    • Analyzes your application total execution time and execution time of its loops/functions using the Survey analysis
    • Counts how many iterations each cycle performs during application runtime using Characterization analysis
    • Estimates execution time for code regions that can be offloaded to a target GPU and total time to transfer data from a CPU to a target GPU using the Performance Modeling analysis

When the execution completes, the Offload Modeling result is opened automatically.

Run Offload Modeling from Command Line Interface (CLI)

On Linux OS

From the command prompt where you set the environment variables:

  1. Navigate to the vector-add sample directory.
  2. Temporarily offload the application to a CPU with the SYCL_DEVICE_FILTER environment variable as follows:
    export SYCL_DEVICE_FILTER=opencl:cpu
  3. Run the Offload Modeling perspective with the medium accuracy level using the command-line collection preset:
    advisor --collect=offload --config=gen11_icl --project-dir=./vector-add -- vector-add-buffers 100000000

    This command runs the Offload Modeling perspective analyses for the default medium accuracy level one by one. During the perspective execution, Intel Advisor:

    • Analyzes your application total execution time and execution time of its loops/functions using the Survey analysis
    • Counts how many iterations each cycle performs during application runtime using Characterization analysis
    • Estimates execution time for code regions that can be offloaded to a target GPU and total time to transfer data from a CPU to a target GPU using the Performance Modeling analysis

When the execution completes, the vector-add project is created automatically, which includes the Offload Modeling results. You can view them with a preferred method.

On Windows OS

From the command prompt where you set the environment variables:

  1. Navigate to the vector-add sample directory.
  2. Temporarily offload the application to a CPU with the SYCL_DEVICE_FILTER environment variable as follows:
    set SYCL_DEVICE_FILTER=opencl:cpu
  3. Run the Offload Modeling perspective with the medium accuracy level using the command-line collection preset:
    advisor --collect=offload --config=gen11_icl --project-dir=.\vector-add -- vector-add-buffers.exe 100000000

    This command runs the Offload Modeling perspective analyses for the default medium accuracy level one by one. During the perspective execution, Intel Advisor:

    • Analyzes your application total execution time and execution time of its loops/functions using the Survey analysis
    • Counts how many iterations each cycle performs during application runtime using Characterization analysis
    • Estimates execution time for code regions that can be offloaded to a target GPU and total time to transfer data from a CPU to a target GPU using the Performance Modeling analysis

When the execution completes, the vector-add project is created automatically, which includes the Offload Modeling results. You can view them with a preferred method.

Examine Application Performance Estimated on the GPU

If you collect data using GUI, Intel Advisor automatically opens the results when the collection completes.

If you collect data using CLI, open the results in GUI using the following command:

advisor-gui ./vector-add

If the result does not open automatically, click Show Result.

You can also view the results in an interactive HTML report. The report data and structure are similar to the GUI results. You can share this report or open it on a remote system even if it does not have Intel Advisor installed. See Work with Standalone HTML Reports in the Intel Advisor User Guide for details.

NOTE:
The results that you see when you open the report might be different from what is shown in the following sections due to a different baseline device or system characteristics. You can still review the sections to understand the result analysis workflow in general.

Explore Performance Estimations for the Whole Application

When you open the Offload Modeling result in GUI, Intel Advisor shows the Summary tab first. This window is a dashboard containing the main information about application execution measured on a baseline device and estimated on a target GPU, estimated speedup, and more.

In the Summary window, notice the following:

  1. As the Top Metrics pane shows, speedup estimated for offloaded parts of the vector-add-buffers code only is 23.655x, but the speedup calculated by the Amdahl's law, which is the speedup for the whole application, is only 1.293x. This means that although the offloaded code regions can run faster on the GPU, they do not have a big impact on the overall application speedup. It might be because only 23% of the code is recommended for offloading.
  2. Explore per-application metrics in the Program Metrics pane. The estimated time on the target GPU (accelerated time) is 16.23 seconds, which includes 16.11 seconds on a host device and 125.2 milliseconds on the target device.
  3. Investigate the factors that limit your offloaded code regions in the Offload Bounded By pane. This pane characterizes all potential bottlenecks for the whole application, such as compute, data transfer, and latencies, with percentage of code impacted.

    The vector-add-buffers application has 77% of non-offloaded code. The offloaded code is mostly bounded by host memory bandwidth (23%), which means that the application might be memory-bound after offloading as it underutilizes memory resources on the host device. If you optimize how the application uses memory after offloading or increase memory bandwidth, you might improve application performance.

    You can see the bottlenecks per code region in the Top Offloaded pane by hovering over a histogram in the Bounded By column.

  4. In the Hardware Parameters pane, notice the host memory bandwidth for the current modeled Gen11 GT2 device is 48 GB/s.

    You can move the Memory BW slider in the pane to increase the value and remodel application performance for a custom device with higher memory bandwidth from command line. You may get a better estimated performance.

    You can use this pane to adjust GPU parameters to model performance on a different device with custom parameters.

  5. Examine the loops/functions with the highest speedup achieved by offloading to a GPU in the Top Offloaded pane. The topmost loop [loop in main at vector-add-buffers.cpp:139] has the highest estimated speedup on the target. You can click a loop to switch to the Accelerated Regions tab and explore the loop performance on the target GPU in more detail.

Analyze Estimated Performance for Specific Loops

In the Accelerated Regions tab, you can analyze estimated performance for each loop and check what loops are recommended to be offloaded to the target GPU.

Intel Advisor reports 5 loops/functions as potentially profitable to be offloaded to the target GPU.

  1. In the bottom pane, switch to the Source tab to examine code region sources.
  2. Click each loop in the Code Regions pane and examine its source in the Source tab to understand code region purpose.
    • [loop in main at vector-add-buffers.cpp:139] is a loop that is supposed to run on a host device to be compared with the execution result of the main loop.
    • [loop in VectorAdd(…)<…> at vector-add-buffers.cpp:83] is the main loop that adds vectors. It is a SYCL kernel.
    • [loop in main at vector-add-buffers.cpp:143] is a loop that is supposed to run on a host device to compare the results of the [loop in VectorAdd(…)<…> at vector-add-buffers.cpp:83] and [loop in main at vector-add-buffers.cpp:139].
    • [loop in InitializeVector at vector-add-buffers.cpp:91] is a loop that initializes data.
    • [loop in [TBB worker at private_server.cpp:265] is a subordinate oneAPI Threading Building Blocks loop.

    Based on application source analysis, [loop in VectorAdd(…)<…> at vector-add-buffers.cpp:83] is the only offload candidate.

  3. Click the [loop in VectorAdd(…)<…> at vector-add-buffers.cpp:83] loop in the Code Regions pane to analyze its performance in more details and examine the following metrics:
    1. The original time measured on a baseline CPU device is 528.6 milliseconds (Measured column group), time estimated on the target GPU after offload is 36 milliseconds (Basic Estimates Metrics column group), which is 14.907% of the total estimated time on the target GPU. This means that the code region can run 45.840x times faster on the target as reported in the Speed-Up column in Basic Estimates Metrics.
    2. In the Summary window, you saw that the application is mostly memory-bound on the target GPU. As indicated in the Bounded By column (Basic Estimated Metrics), the selected code region is bounded by DRAM bandwidth.
    3. Scroll to the right to the Estimated Bounded By column group. As indicated in the Throughout column, the code region spends 32.9 milliseconds to read from and write to the DRAM memory.
    4. Scroll to the Estimated Data Transfer with Reuse to see the estimated data traffic modeled between host and target devices.
    5. Scroll to the right to the Memory Estimations column group to see the traffic between target device memory levels modeled with cache simulation analysis. The column group shows traffic for each memory level on the target, such as L3 cache, last-level cache (LLC), DRAM memory, GTI memory.
  4. Examine performance summary for the selected code region in the Details pane on the right. As the pane reports, total estimated time for the code region is 36 milliseconds, which includes:
    • 5.4 milliseconds for computing vector addition
    • 32.9 milliseconds for reading from and writing to DRAM
    • 5.6 milliseconds for reading from and writing to L3 cache
    • 16.8 milliseconds for writing to last level cache (LLC)

    This means the code region mostly spends time for working with memory.

  5. View actionable recommendations for the selected loop in the Recommendations tab. Intel Advisor recommends you offload the selected loop to the target as it has high estimated speedup. The recommendation includes example code snippets for offloading the loop using SYCL or OpenMP*, which you can expand to see more syntax.

    You can also view the selected loop in using the Top-Down tab to locate it in the application call tree and see how it is related to other loops and examine its source code using the Source tab

Continue to examine the performance of other code regions in the application to get a better understanding on its performance estimated on the target.

Next Steps

  1. In the Code Regions pane, notice that [loop in VectorAdd(…)<…> at vector-add-buffers.cpp:83] has Parallel: Explicitdependency type, but there are three other loops/functions that have Parallel: Assumed dependency type reported (Measured column group). This means Intel Advisor does not have information about dependencies in these code regions, but it assumes there are no dependencies and the loops/functions can be parallelized.

    In most cases, if a loop has dependencies, it is not recommended to offload it to the target GPU. You can analyze the loop-carried dependencies with the Dependencies analysis of the Intel Advisor. See Check How Assumed Dependencies Affect Modeling in the user guide for a recommended workflow.

  2. If Intel Advisor detects real loop-carried dependencies, it models them as sequential kernels. Usually, such kernels are not profitable to run on a GPU. You need to resolve the dependencies before you offload your code to the target.
  3. Get more detailed report about data transfers in the application by running the Characterization analysis with a different data transfer simulation mode: medium to understand how memory objects are transferred or full to check if data reuse can improve the application performance.

Based on the data reported about the vector-add-buffers sample application, you can run it on the target GPU and optimize its performance or you can run other analyses to learn more about application behavior on the target GPU. After this, you can use GPU Roofline Insights perspective to profile application actual performance on the target GPU and see how it uses hardware resources.