Estimate the C++ Application Speedup on a Target GPU
This recipe illustrates how to check if your C++ application is profitable to be offloaded to a target GPU device using
Intel® Advisor
.
Scenario
Offload Modeling
workflow includes the following two steps:
- Collect application characterization metrics on CPU: run the Survey analysis, the Trip Counts and FLOP analysis, and optionally, the Dependencies analysis.
- Based on the metrics collected, estimate application execution time on a graphics processing unit (GPU) using an analytical model.
Information about loop-carried dependencies is important for application performance modeling because only parallel loops can be offloaded to GPU.
Intel Advisor
can get this information from an Intel Compiler, an application callstack tree, and/or based on the Dependencies analysis results. The Dependencies analysis is the most common way, but it adds high overhead to performance modeling flow.
In this recipe, we will first run the
Offload Modeling
assuming that the loops do not contain dependencies and then will verify this by running the Dependencies analysis for the profitable loops only.
There are three ways to run the
Offload Modeling
: from the
Intel Advisor
graphical user interface (GUI), from the
Intel Advisor
command line interface (CLI), or using
Python* scripts delivered with the product. This recipe uses the CLI to run analyses and the GUI to view and investigate the results.
Ingredients
This section lists the hardware and software used to produce the specific result shown in this recipe:
- Performance analysis tools:Intel® Advisor2021Available for download at https://software.intel.com/content/www/us/en/develop/articles/oneapi-standalone-components.html as a standalone and at https://software.intel.com/content/www/us/en/develop/tools/oneapi/base-toolkit/download.html as part of the Intel® oneAPI Base Toolkit.
- Application: Mandelbrot is an application that generates a fractal image by matrix initialization and performs pixel-independent computations.There are two implementations available for download:
- A native C++ implementation, which you can analyze withOffload Modeling
- A SYCL implementation, which you can run on a GPU and compare its performance with theIntel AdvisorpredictionsSelect a device to run the application on by setting theSYCL_DEVICE_TYPE=environment variable.<CPU|GPU|FPGA|HOST>
- Compiler:Intel® C++ Compiler Classic2021 andIntel® oneAPI2021DPC++/C++CompilerAvailable for download as part of the Intel® oneAPI HPC Toolkit at https://software.intel.com/content/www/us/en/develop/tools/oneapi/hpc-toolkit/download.html.
- Operating system: Microsoft Windows* 10 Enterprise
- CPU: Intel® Core™ i7-8665U processor
- GPU: Intel® UHD Graphics 620 (Gen9 GT2 architecture configuration)
Prerequisites
Set up environment variables for the tools:
<oneapi-install-dir>\setvars.bat
Compile the C++ Mandelbrot Sample
Consider the following when compiling the C++ version of the Mandelbrot sample:
- The benchmark code consists of the main source filemandelbrot.cpp, which performs computations, and several helper filesmain.cppandtimer.cpp. You should include all three source files into the target executable.
- Use the following recommended options when compiling the application:
- /O2to request moderate optimization level and optimize code for maximum speed
- /Zito enable debug information required for collecting characterization metrics
See details about other recommended options in theIntel AdvisorUser Guide.
Run the following command to compile the C++ version of the Mandelbrot sample:
icx.exe /Qm64 /Zi /nologo /W3 /O2 /Ob1 /Oi /D NDEBUG /D _CONSOLE /D _UNICODE /D UNICODE /EHsc /MD /GS /Gy /Zc:forScope /Fe"mandelbrot_base.exe" /TP src\main.cpp src\mandelbrot.cpp src\timer.cpp
For details about Intel C++ Compiler Classic options, see
Intel® C++ Compiler Classic Developer Guide and Reference.
Run
Offload Modeling without Dependencies Analysis
Offload Modeling
without Dependencies Analysis First, get rough performance estimations using a special operating mode of the performance model that ignores potential loop-carried dependencies. In the CLI, use the
--no-assume-dependencies
command line option to activate this mode.
To model the Mandelbrot application performance on the target GPU with the Gen9 GT2 configuration:
- Run Survey analysis to get baseline performance data:advisor --collect=survey --stackwalk-mode=online --static-instruction-mix --project-dir=.\advisor_results --search-dir sym=.\x64\Release --search-dir bin=.\x64\Release --search-dir src=. -- .\x64\Release\mandelbrot_base.exe
- Run Trip Counts and FLOP analysis to get call count data and model cache for the Gen9 GT2 configuration:advisor --collect=tripcounts --flop --stacks --enable-cache-simulation --data-transfer=light --target-device=gen9_gt2 --project-dir=.\advisor_results --search-dir sym=.\x64\Release --search-dir bin=.\x64\Release --search-dir src=. -- .\x64\Release\mandelbrot_base.exe
- Model application performance on the GPU with the Gen9 GT2 configuration ignoring assumed dependencies:advisor --collect=projection --config=gen9_gt2 --no-assume-dependencies --project-dir=.\advisor_resultsThe--no-assume-dependenciesoption allows to minimize the estimated time and assumes a loop is parallel without dependencies.
The collected results are stored in the
advisor_results
project that you can open in the GUI.
View Estimated Performance Results
To view the results in the GUI:
- Run the following from the command prompt to open theIntel Advisor:advisor-gui
- Go to, navigate to theadvisor_resultsproject directory where you stored results, and open the.advixeprojproject file.
- If theOffload Modelingreport does not open, clickShow Resulton the Welcome pane.TheSummaryresults collected for theadvisor_resultsproject should open.
If you do not have the
Intel Advisor
GUI or need to check the results briefly before copying them to a machine with the
Intel Advisor
GUI, you can open an HTML report located at
.\advisor_results\e000\pp000\data.0\report.html
. See
Identify Code Regions to Offload to GPU and Visualize GPU Usage for more information about the HTML report.
Explore
Offload Modeling
Summary The
Summary
tab of the
Offload Modeling
report shows modeling results in several views:
- In theTop MetricsandProgram Metricspanes, review per-program performance estimations and comparison with the baseline application performance on CPU.
- In theOffload Bounded Bypane, review characterization metrics and factors that limit the performance of regions in relation to their execution time.
- In theTop Offloadedpane, review top five regions recommended for offloading to the selected target device.
- In theTop Non-Offloadedpane, review top five non-offloaded regions and the reasons why they are not recommended to be run on the target.

For the Mandelbrot application, consider the following data:
- The estimated execution time on the target GPU is 0.05 s.
- The loop atmandelbrot.cpp:56is recommended to be offloaded to the GPU.
- Its execution time is 81% of the total execution time of the whole application (seeFraction of Accelerated Code).
- If this loop runs on the target GPU, it is executed 13.9 times faster than on the CPU (seeSpeed Up for Accelerated Code).
- If this loop is offloaded, the whole application runs 4.1 times faster, according to the Amdahl’s Law (seeAmdahl’s Law Speed Up).
- The loop atstb_image_write.h:885is not profitable for offloading because the overhead for offloading it to the GPU is high.
Explore Accelerated Regions Report
To open the full
Offload Modeling
report, do one of the following:
- Click theAccelerated Regionstab at the top of the report.
- Click a loop/function name hyperlink in theTop Offloadedor inTop Non-Offloadedpane.
Accelerated Regions report shows details about all offloaded and non-offloaded code regions. Review the data reported in the following panes:
- TheCPU+GPUtable shows the result of modeling execution of each code region on the GPU: it reports performance metrics measured on the baseline CPU platform and metrics estimated for application performance modeled on the target GPU, such as expected execution time and what a bottleneck is (for example, if a code region is compute or memory bound). You can expand data columns and scroll the grid to see more projected metrics.For the Mandelbrot application, the loop atmandelbrot.cpp:56is recommended for offloading to the GPU. It is compute bound and its estimated execution time on the Gen9 GT2 GPU is 12.3 ms. This region transfers 2.1 MB of data, mostly from GPU to CPU (write data transfers), but it does not add overhead since the GPU is integrated.
- Click themandelbrot.cpp:56code regions in theCPU+GPUtable to see its source code in theSourceview with several offload parameters.
- Switch to theTop-Downtab to locate themandelbrot.cpp:56region in the application call tree. Use this pane to review the loop metrics together with its callstack.
Run
Offload Modeling with Dependencies Analysis
Offload Modeling
with Dependencies Analysis The Dependencies analysis detects loop-carried dependencies, which do not allow to parallelize the loop and offload it to GPU. At the same time, this analysis is slow: it adds a high runtime overhead to your target application execution time making it 5-100x slower. Run the Dependencies analysis if your code might not be effectively vectorized or parallelized.
- In theCPU+GPUtable, expand the loop atmandelbrot.cpp:56to see its child loops.
- Expand theMeasuredcolumn group.TheDependency Typecolumn reportsParallel: Assumedfor themandelbrot.cpp:56and its child loops. This means thatIntel Advisormarks these loops as parallel because you used the--no-assume-dependenciesoption for the performance modeling, but itdoes nothave information about their actual dependency type.If you are sure that loops in your application are parallel, you can skip the Dependencies analysis. Such loops should have aParallel:value in the<reason>Dependency Typecolumn, where<value>isExplicit,Proven,Programming Model, orWorkload.
- To check if the loops have real dependencies, run the Dependencies analysis.
- To minimize the Dependencies analysis overhead, select the loops with theParallel: Assumedvalue to check their dependency type, for example, using loop IDs. Run the following command to get IDs of those loops:advisor --report=survey --project-dir=.\advisor_results -- .\x64\Release\mandelbrot_base.exeThis command prints the Survey analysis results with loop IDs to the command prompt. Themandelbrot.cpp:57andmandelbrot.cpp:56loops have IDs 2 and 3.
- Run the Dependencies analysis with the--mark-up-list=2,3option to analyze only the loops of interest:advisor --collect=dependencies --mark-up-list=2,3 --loop-call-count-limit=16 --filter-reductions --project-dir=.\advisor_results -- .\x64\Release\mandelbrot_base.exe
- Rerun the performance modeling to get the refined performance estimation:advisor --collect=projection --config=gen9_gt2 --project-dir=.\advisor_results
- Open theadvisor_resultsprojects with refined results in the GUI:advisor-gui .\advisor_results
The results of the Dependencies analysis and
Offload Modeling
are based on the Survey and Trip Counts and FLOP data collected before.
In the
Accelerated Regions
report, the loop at
mandelbrot.cpp:56
and its child loops have the
Parallel: Workload
value in the
Dependency Type
column. This means that
Intel Advisor
did not find loop-carried dependencies and these loops can be offloaded and executed on the GPU.
Rewrite the Code in SYCL
Now you can rewrite the code region at
mandelbrot.cpp:56
, which
Intel Advisor
recommends to execute on the target GPU using the SYCL programming model.
The SYCL code should include the following actions:
- Selecting a device
- Declaring a device queue
- Declaring a data buffer
- Submitting a job to the device queue
- Executing the calculation in parallel
The resulting code should look like the following code snippet from the SYCL version of Mandelbrot sample
using namespace sycl;
// Create a queue on the default device. Set SYCL_DEVICE_TYPE environment
// variable to (CPU|GPU|FPGA|HOST) to change the device
queue q(default_selector{}, dpc_common::exception_handler);
// Declare data buffer
buffer data_buf(data(), range(rows, cols));
// Submit a command group to the queue
q.submit([&](handler &h) {
// Get access to the buffer
auto b = data_buf.get_access(h,write_only);
// Iterate over image and write to data buffer
h.parallel_for(range<2>(rows, cols), [=](auto index) {
…
b[index] = p.Point(c);
});
});
Make sure your SYCL code (
mandel.hpp
in the SYCL sample) contains the same values of image parameters as the C++ version:
constexpr int row_size = 2048;
constexpr int col_size = 1024;
See
SYCL page and
oneAPI GPU Optimization Guide for more information.
Compare Estimations and Real Performance on GPU
- Compile the Mandelbrot sample as follows:dpcpp.exe /W3 /O2 /nologo /D _UNICODE /D UNICODE /Zi /WX- /EHsc /MD /I"$(ONEAPI_ROOT)\dev-utilities\latest\include" /Fe"mandelbrot_dpcpp.exe" src\main.cpp
- Run the compiled mandelbrot application:mandelbrot_dpcpp.exe
- Review the application output printed to the command prompt. It reports application execution time:Parallel time: 0.0121385sMandelbrot calculation in the offloaded loop takes12.1 mson the GPU. This is close to the12.3 msexecution time predicted by theIntel Advisor.
Key Take-Aways
- Offload Modeling feature of the Intel Advisor can help you to design and prepare your application for executing on a GPU before you have the hardware. It can model your application performance on a selected GPU, calculate potential speedup and execution time, identify offload opportunities, and locate potential bottlenecks on the target hardware.
- Loops with dependencies cannot be executed in parallel and offloaded to a target GPU. When modeling application performance with the Intel Advisor, you can select to ignore potential dependencies or consider all potential dependencies. You can run the Dependencies analysis to get more accurate modeling results if your application might not be effectively vectorized or parallelized.