Overview
This document demonstrates how a MonteCarloMultiGPU - MonteCarlo Option Pricing with Multi-GPU support written in CUDA* can be migrated to the SYCL* heterogenous programing language.
MonteCarloMultiGPU
MonteCarloMultiGPU sample evaluates fair call price for a given set of European options using the Monte Carlo approach. MonteCarlo simulation is one of the most important algorithms in quantitative finance. This sample uses a single CPU Thread to control multiple GPUs.
MonteCarlo method is basically a way to compute expected values by generating random scenarios and then averaging them, it is very efficient to parallelize. With the GPU we can reduce this problem by parallelizing the paths. That is, we can assign each path to a single thread, simulating thousands of them in parallel, with massive savings in computational power and time.
CURAND Library and oneMKL RNG
The CURAND library provides facilities that focus on the simple and efficient generation of high-quality pseudorandom and quasirandom numbers. CURAND consists of two pieces: a library on the host (CPU) side and a device (GPU) header file.
Intel® oneAPI Math Kernel Library (oneMKL) provides SYCL interfaces for the Vector Statistics Random Number Generators (RNG) routines implementing commonly used pseudorandom, quasi-random, and non-deterministic generators with continuous and discrete distributions.
CUDA to SYCL Migration Approach
This document covers two approaches for CUDA to SYCL migration:
The first approach is using the Intel® DPC++ Compatibility Tool to automatically migrate CUDA source to SYCL source. The tool migrates 80 to 90 percent of the code and generates a warning for the rest, which must be manually migrated to SYCL. We look at Intel DPC++ Compatibility Tool generated warnings and learn how to migrate the code that was not migrated by the tool. This approach helps to accelerate the migration of CUDA source to SYCL and has proven especially helpful for large code bases.
The second approach is manual migration by analyzing CUDA source and replacing all CUDA-specific calls with equivalent SYCL calls. This approach helps a CUDA developer to understand SYCL programming. Once the migration is complete, we do performance analysis using Intel® VTune™ Profiler and Intel® Advisor Roofline, to understand the performance bottlenecks. Review the SYCL 2020 Specification for more details.
The following flow diagram shows the approach used for CUDA to SYCL migration:
Using Intel® DPC++ Compatibility Tool for migration
The Intel DPC++ Compatibility Tool and how to use it
The Intel® oneAPI Base Toolkit oneAPI provides a compatibility tool that facilitates the migration of a program that is written in CUDA to the C++ with SYCL programming language.
Intel DPC++ Compatibility Tool automatically migrates most of the code. In addition, regarding non-ported code, dpct inline comments (through warning messages) that helps to migrate and fine tune the SYCL code. The tool warnings indicate how and where manual intervention is needed. These warnings have an assigned ID of the form "DPCT10XX" that can be consulted in the Developer Guide and Reference. This guide contains a list of all the warnings, their description, and a suggestion to fix it. Once the migration is done verify the correctness and efficiency of the resulting oneAPI program.
Migrate CUDA to SYCL using Intel DPC++ Compatibility Tool
The Intel DPC++ Compatibility Tool helps transfer CUDA-based code to SYCL and generates human-readable code, with the original identifiers from the original code preserved. The tool also detects and transforms standard CUDA index computations to SYCL.
The goal of this sample is to perform the migration process from CUDA to SYCL using the Intel DPC++ Compatibility Tool and demonstrate portability obtained by the migrated SYCL code in different GPU and CPU devices. The tool works by intercepting the build process and replacing CUDA code with the SYCL counterpart.
The DPC++ Compatibility Tool supports migration of CUDA kernels, host and device API calls (for example, memory management, events, math, etc..) and library calls (cuBLAS, cuSPARSE, cuSolver, cuRand, etc..). Typically, 80%-90% of CUDA code migrates to SYCL code by the tool.
The Intel DPC++ Compatibility Tool migrated code for MonteCarloMultiGPU can be found at sycl_dpct_output.
To ensure you have the CUDA versions and required tools, please see Intel DPC++ Compatibility Tool system requirements
Follow these steps to migrate the CUDA MonteCarloMultiGPU sample to SYCL:
- Make sure the system has Nvidia CUDA SDK installed (in the default path) and you have installed the Intel DPC++ Compatibility Tool from the Intel® oneAPI Base Toolkit.
- Set the environment variables, the setvars.sh script is in the root folder of your oneAPI installation, which is typically /opt/intel/oneapi/
. /opt/intel/oneapi/setvars.sh
- Get the CUDA implementation of the MonteCarloMultiGPU sample from: MonteCarloMultiGPUCUDA Sample.
- Generate a compilation database with the tool intercept-build. This creates a JSON file with all the compiler invocations and stores the names of the input files and the compiler options.
intercept-build make
- Use the Intel DPC++ Compatibility Tool to migrate the code; it will store the result in the migration folder dpct_output.
dpct -p compile_commands.json
Intel DPC++ Compatibility Tool Options that Ease Migration and Debug
--keep-original-code Keep original CUDA code in the comments of generated SYCL file. Allows easy comparison of original CUDA code to generated SYCL code. --comments Insert comments explaining the generated code ---always-use-async-handler Always create cl::sycl::queue with the async exception handler - Verify the migration and address any Intel DPC++ Compatibility Tool warnings generated by consulting the diagnostics reference for detailed information about the Intel DPC++ Compatibility Tool warnings
- Adapt the makefile to use the DPCPP compiler when appropriate and remove the CUDA-specific compilation flags.
For more information refer to Intel® DPC++ Compatibility Tool Best Practices.
Implement unmigrated SYCL code
After the Intel DPC++ Compatibility Tool migration, the unmigrated code can be identified by the warnings. These warnings have an assigned ID, which can be resolved by manual workarounds by referring to the Developer Guide and Reference.
The Intel DPC++ Compatibility Tool complete migrated code for MonteCarloMultiGPU can be found at sycl_dpct_migrated.
Warnings generated with migration and manual workaround:
- DPCT1027: The call to curandSetPseudoRandomGeneratorSeed was replaced with 0 because the function call is redundant in DPC++.
checkCudaErrors(curandSetPseudoRandomGeneratorSeed(gen, seed));
API calls from the original application, which do not have functionally compatible DPC++ API calls are replaced with 0 if the Intel DPC++ Compatibility Tool determines that this call removal should not affect the program logic. In SYCL we can achieve functional correctness even after removal of API.
- DPCT1003: Migrated API does not return error code. (*, 0) is inserted. You may need to rewrite this code.
checkCudaErrors((sycl::free(plan->rngStates, dpct::get_default_queue()), 0));
CUDA code uses error codes for error checking, SYCL uses exceptions instead of error codes.
In SYCL everything is event based and handles all the error as exceptions.
auto exception_handler = [](exception_list exceptions) {
for (std::exception_ptr const &e : exceptions) {
try {
std::rethrow_exception(e);
} catch (exception const &e) {
std::cout << "Caught asynchronous SYCL exception during ASUM:\n"
<< e.what() << std::endl;
}
}
};
for (int i = 0; i < nPlans; i++) {
streams[i] =
sycl::queue(gpu_devices[plan[i].device],exception_handler,property::queue::in_order());
}
- DPCT1065: Consider replacing sycl::nd_item::barrier() with sycl::nd_item::barrier(sycl::access::fence_space::local_space) for better performance if there is no access to global memory.
cg::sync(cta);
Inside a kernel, DPCT suggests replacing barrier() for better performance if there is no access to global memory. In this case the user should check the memory accesses and do the modification.
- DPCT1005: The SYCL device version is different from CUDA Compute Compatibility. You may need to rewrite this code.
|
In SYCL the logic in the generated code uses a version extracted from the cl::sycl::info::device::version, which is different than CUDA Compute Capability.
auto device = q_ct1.get_device();
int Cores = device.get_info<cl::sycl::info::device::max_compute_units>();
Analyzing CUDA source
The CUDA implementation of the MonteCarloMultiGPU is available at this Github site.
The CUDA source for the MonteCarloMultiGPU implementation is distributed in following files.
- MonteCarlo_gold.cpp – host code for
- Calculating call value using Black-Scholes formula
- Black-Scholes formula for MonteCarlo results validation
- Implementation of CPU MonteCarlo Method
- MonteCarloMultiGPU.cpp
- Utility function to tweak problem size for small GPUs
- CPU reference functions
- Allocate and initialize an array of stream handles
- Init each GPU
- Main Program
- MonteCarlo_kernel.cu
- Contains a kernel which computes the integral over all paths using a single thread block per option
- Host-side interface to GPU MonteCarlo
- Compute statistics and deallocate internal device memory
- multithreading.cpp
- Contains create thread function
- Wait for thread to finish
- Wait for multiple threads
- MonteCarlo_reduction.cuh
- Contains a function which calculates total sum for each of the two input arrays
- MonteCarlo_common.h
- Host-side data source and result destination
- Device and host side option data
- Random number generator states
- Intermediate device-side buffers
- multithreading.h
- Simple portable thread library
- realtype.h
- Defining float and double real type
The CUDA code has eight files mentioned above. The MonteCarlo_gold.cpp has CPU Montecarlo Method and Black-Scholes Formula for Montecarlo results validation. MonteCarlo_kernel.cu contains MonteCarloOneBlockPerOption kernel which computes the integral overall paths using a single thread block per option. MonteCarloMultiGPU.cpp has solver thread for driving host thread and utility function to tweak problem size for small GPUs.
This sample application demonstrates the CUDA MonteCarloMultiGPU using key concepts such as
Random Number Generator, Computational Finance and CURAND Library.
Migrating from CUDA to SYCL
In this section we will migrate the CUDA code to SYCL by analyzing the CUDA code and identifying relevant SYCL features. The underlying concepts of CUDA and SYCL are similar but understanding the nomenclature for each language is essential to migrating a CUDA code to a SYCL code. Review the SYCL 2020 Specification for more details.
The CUDA code in MonteCarloMultiGPU.cpp, MonteCarlo_gold.cpp, MonteCarlo_kernel.cu and MonteCarlo_reduction.cuh will be migrated to SYCL versions in MonteCarloMultiGPU.cpp, MonteCarlo_gold.cpp, MonteCarlo_kernel.cpp and MonteCarlo_reduction.hpp.
CUDA headers and SYCL headers
In the CUDA implementation, there are two levels for the runtime API. The C API (cuda_runtime_api.h) is a C-style interface that does not require compiling with nvcc. cuda_runtime.h header is a C++-style interface built on top of the C API. This defines the public host functions, built-in type definition for the CUDA runtime API, and function overlays for the CUDA language extensions and device intrinsic functions. It wraps some of the C API routines, using overloading, references, and default arguments.
#include <cuda_runtime.h>
In the SYCL implementation, for compatibility with SYCL 1.2.1, SYCL provides standard header file <CL/sycl.hpp>, which can be included in place of <sycl/sycl.hpp>. In that case, all SYCL classes, constants, types and functions defined by this specification should exist within the ::cl::sycl C++ namespace. The header file sycl.hpp is provided in the Intel® oneAPI DPC++/C++ Compiler.
#include <CL/sycl.hpp>
CUDA streams and SYCL queues
A CUDA stream is a sequence of operations that execute on the device in the order in which they are issued by the host code. The host places CUDA operations within a stream (for example, kernel launches, memory copies) and continues immediately. The device then schedules work from streams when resources are free. Operations within the same stream are ordered first-in, first-out (FIFO). Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently.
SYCL has queues that connect a host program to a single device. Programs submit tasks to a device via the queue and may monitor the queue for completion. In a similar fashion to CUDA streams, SYCL queues submit command groups for execution asynchronously. However, SYCL is a higher-level programming model, and data transfer operations are implicitly deduced from the dependencies of the kernels submitted to any queue.
In the CUDA implementation, the first step is to create a new asynchronous stream.
cudaStream_t *streams = (cudaStream_t *)malloc(nPlans * sizeof(cudaStream_t));
In SYCL we use queues in a similar fashion to CUDA streams; queues submit command groups for execution asynchronously. The SYCL runtime handles the execution order of the different command groups (kernel + dependencies) automatically across multiple queues in different devices.
|
More information can be found in SYCL queue.
CUDA Device and SYCL Device
In the CUDA implementation, the first step is initialization of the code to find the best CUDA device. It is a generic function to identify a GPU. cudaSetDevice sets the current GPU. cudaStreamCreate creates an asynchronous stream. cudaEventCreate creates an event object.
It is implemented in CUDA MonteCarloMultiGPU.cpp as follows:
for (int i = 0; i < nPlans; i++) {
checkCudaErrors(cudaSetDevice(plan[i].device));
checkCudaErrors(cudaStreamCreate(&(streams[i])));
checkCudaErrors(cudaEventCreate(&(events[i])));
}
In SYCL, cl::sycl::device::get_devices returns vector of devices filtered by sycl::info::device_type. sycl::device is an abstract class representing various models of SYCL devices. A device could be a GPU, CPU, or other type of accelerator. Devices execute kernel functions.
SYCL queues connect a host program to a single device. Programs submit tasks to a device via the queue and may monitor the queue for completion.
This implementation can be found in MonteCarloMultiGPU.cpp as follows:
auto gpu_devices = cl::sycl::device::get_devices(cl::sycl::info::device_type::gpu);
for (int i = 0; i < nPlans; i++) {
streams[i] = sycl::queue(gpu_devices[plan[i].device],exception_handler,property::queue::in_order());
}
We can also set up a SYCL queue with in_order queue property and an exception_handler. The in_order queue will make sure that the kernel computation starts only after the memcpy operations are complete and no overlap of kernel execution occurs. In SYCL everything is event based and handles all the error as exceptions.
More information can be found in SYCL queue.
Memory Allocation on GPU Device — cudaMalloc and sycl::malloc_device
Memory must be first allocated on the GPU device in order to use it to copy data to GPU memory, so that it is available for computation on GPU. The cudaMalloc function can be called from the host or the device to allocate memory on the device, much like malloc for the host. The memory allocated with cudaMalloc must be freed with cudaFree.
In CUDA, memory allocation on GPU is done as follows using the cudaMalloc function:
checkCudaErrors(cudaMalloc(&plan->d_OptionData, sizeof(__TOptionData) * (plan->optionCount)));
checkCudaErrors(cudaMalloc(&plan->d_CallValue, sizeof(__TOptionValue) * (plan->optionCount)));
In SYCL, memory allocation on the accelerator device is accomplished using the sycl::malloc_device function as follows:
plan->d_OptionData = (void *)sycl::malloc_device(sizeof(__TOptionData) * (plan->optionCount), *stream);
plan->d_CallValue = (void *)sycl::malloc_device(sizeof(__TOptionValue) * (plan->optionCount), *stream);
The sycl::malloc_device returns a pointer to the newly allocated memory on the specified device on success. This memory is not accessible on the host. Memory allocated by the sycl::malloc_device must be deallocated with sycl::free to avoid memory leaks.
Memory copy from host to GPU memory
Once memory is allocated on the GPU, we must copy the memory from host to device, so that the data is available at the device for computation.
In CUDA, memory is copied from host to GPU using cudaMemcpyAsync, as follows. cudaMemcpyAsync is used to copy the data from host to device or from device to host. This function exhibits asynchronous behavior for most use cases. The cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost flags are used to dictate the direction of data transfer. cudaMemset is used to Initializes or sets device memory to a value.
checkCudaErrors(cudaMemcpyAsync(plan->d_OptionData, h_OptionData,
plan->optionCount * sizeof(__TOptionData), cudaMemcpyHostToDevice, stream));
checkCudaErrors(cudaMemcpyAsync(h_CallValue, plan->d_CallValue,
plan->optionCount * sizeof(__TOptionValue), cudaMemcpyDeviceToHost, stream));
checkCudaErrors(cudaMemset(plan->rngStates, 0, plan->gridSize * THREAD_N * sizeof(curandState)));
In SYCL, we use memcpy to copy memory from host to device memory. To initialize memory, memset can be used to initialize the vector with data as shown:
stream->memcpy(plan->d_OptionData, h_OptionData,
plan->optionCount * sizeof(__TOptionData));
stream->memcpy(h_CallValue, plan->d_CallValue,
plan->optionCount * sizeof(__TOptionValue));
stream->memset(plan->rngStates, 0,plan->gridSize * THREAD_N * sizeof(oneapi::mkl::rng::device::philox4x32x10<4>)).wait();
The first argument is memory address pointer with value; this must be a USM allocation. SYCL memcpy copies data from the pointer source to the destination. Both source and destination may be either host or USM pointers.
Memory is copied asynchronously, but before any of the memory can be used, we need to ensure that the copy is complete by synchronizing using:
wait() will block the execution of the calling thread until all the command groups submitted to the queue have finished execution.
More information about SYCL memcpy and asynchronous copy and synchronizing data can be found at SYCL queue and memcpy and wait.
This completes the host-side CUDA code migration to SYCL:
- The CUDA host code can be found in MonteCarloMultiGPU.cpp MonteCarlo_kernel.cu
- The SYCL code for host can be found in MonteCarloMultiGPU.cpp MonteCarlo_kernel.cpp
The following sections explain CUDA kernel code migration to SYCL.
Offloading computation to GPU
The CUDA kernel code is in MonteCarlo_kernel.cu. The computation happens in two kernels, MonteCarloOneBlockPerOption and rngSetupStates. These computations are offloaded to the device. In MonteCarloOneBlockPerOption,it computes the integral overall paths using a single thread block per option. curand_normal (&localState) function returns a single normally distributed float with mean 0.0 and standard deviation 1.0. This result can be scaled and shifted to produce normally distributed values with any mean and standard deviation. The curand_init() function sets up an initial state allocated by the caller using the given seed, subsequence and offset. Different seed is guaranteed to produce different starting states and different sequences.
In CUDA, a group of threads is named a thread block or simply a block. This is equivalent to the SYCL concept of work-group. Both block and work-group can access the same level of the hierarchy and expose similar synchronization operations.
In CUDA, a kernel is launched with the following parameters: blocks specify the dimension and size of the grid.
rngSetupStates<<<plan->gridSize, THREAD_N>>>(plan->rngStates, plan->device);
In SYCL, kernel constructs like single_task, parallel_for, and parallel_for_work_group each take a function object or a lambda function as one of their arguments. The code within the function object or lambda function is executed on the device.
cgh.parallel_for(sycl::nd_range<3>(sycl::range<3>(1, 1, plan->gridSize) *
sycl::range<3>(1, 1, THREAD_N),
sycl::range<3>(1, 1, THREAD_N)),
[=](sycl::nd_item<3> item_ct1) {
rngSetupStates(plan_rngStates_ct0, plan_device_ct1, item_ct1);
});
});
After the queue setup, in our command group we submit a kernel using parallel_for. This function will execute the kernel in parallel on several work-items. An sycl::nd_range specifies a 1-, 2-, or 3-dimensional grid of work items that each executes the kernel function, which are executed together in work groups. The sycl::nd_range consists of two 1-, 2-, or 3 dimensional ranges: the global work size (specifying the full range of work items) and the local work size (specifying the range of each work group).
The sycl::nd_item describes the location of a point in a sycl::nd_range. An nd_item is typically passed to a kernel function in a parallel_for. In addition to containing the ID of the work item in the work group and global space, the sycl::nd_item also contains the sycl::nd_range defining the index space.
CUDA thread block and SYCL work-group
In CUDA, Cooperative Groups provide device code APIs for defining, partitioning, and synchronizing groups of threads. We often need to define and synchronize groups of threads smaller than thread blocks to enable greater performance and design flexibility.
An instance of thread_block is a handle to the group of threads in a CUDA thread block that you initialize as follows:
cg::thread_block cta = cg::this_thread_block();
Every thread that executes that line has its own instance of the variable block. Threads with the same value of the CUDA built-in variable blockIdx are part of the same thread block group.
In SYCL, a single execution of a given kernel is organized into work-groups and work-items. Each work-group contains the same number of work-items and is uniquely identified by a work-group ID. Additionally, within a work-group, a work-item can be identified by its local ID, and the combination of a local ID with a work-group ID is equivalent to the global ID.
auto cta = item_ct1.get_group();
SYCL get_group returns the constituent element of the group ID representing the work-group’s position within the overall sycl::nd_range in the given dimension.
CUDA thread block synchronization and SYCL barrier synchronization
Synchronization is used to synchronize the states of threads sharing the same resources.
In CUDA, Synchronization is supported by all thread groups. We can synchronize a group by calling its collective sync() method, or by calling the cooperative_groups::sync() function. These perform barrier synchronization among all threads in the group.
cg::sync(cta);
In SYCL, to synchronize the state of memory, we use the item::barrier(access::fence_space) operation. It makes sure that each work-item within the work-group reaches the barrier call. In other words, it guarantees that the work-group is synchronized at a certain point in the code.
item_ct1.barrier();
item::barrier emits a memory fence in the specific space—it can be either access::fence_space::local_space, ::global_space or ::global_and_local. A fence ensures that the state of the specified space is consistent across all work-items within the work-group.
CUDA cooperative group and SYCL subgroup
CUDA Cooperative Groups and SYCL subgroup aim to extending the programming model to allow kernels to dynamically organize groups of threads so that threads cooperate and share data to perform collective computations.
In CUDA, Cooperative Groups provides you with the flexibility to create new groups by partitioning existing groups. This enables cooperation and synchronization at finer granularity. The cg::tiled_partition() function partitions a thread block into multiple tiles.
cg::thread_block_tile<32> tile32 = cg::tiled_partition<32>(cta);
Each thread that executes the partition will get a handle (in tile32) to one 32-thread group.
In SYCL, sub groups allow partition of a work-group which map to low-level hardware and provide additional scheduling guarantees. The subgroup is an extension to the SYCL execution model, and it is hierarchically between the work_group and work_item. SYCL implementations often map sub-groups to low-level hardware features: for example, it is common for work-items in a sub-group to be executed in SIMD on hardware supporting vector instructions.
sycl::sub_group tile32 = item_ct1.get_sub_group();
This completes the kernel side CUDA code migration to SYCL: sycl_migration can be found here.
The CUDA kernel code can be found at MonteCarlo_kernel.cu
The SYCL kernel code can be found at MonteCarlo_kernel.cpp
This concludes all the CUDA migration to SYCL. The source files can be compiled for any GPU using the appropriate SYCL compiler, rather than having source in CUDA that can only run on Nvidia GPUs.
With this SYCL source, we can compile to run MonteCarloMultiGPU on Intel® GPUs or CPUs using oneAPI DPC++ Compiler. Or we can compile to run on Nvidia GPUs/AMD GPUs using the open source LLVM compiler or hipSYCL compiler.
SYCL allows our source code to be portable across CPUs and GPUs from different vendors rather than being locked to a vendor-specific hardware.
Tools for performance analysis
Intel® VTune™ Profiler
Intel® VTune Profiler is a performance analysis tool for serial and multi-threaded applications. It helps to analyze algorithm choices and identify where and how applications can benefit from available hardware resources. The data collector profiles your application using the OS timer, interrupts a process, collects samples of all active instruction addresses with the sampling interval of 10ms, and captures a call sequence (stack) for each sample. By default, the collector does not gather system-wide performance data but focuses on application only. Review the Get Started with Intel® VTune™ Profiler for more details.
By default, VTune Profiler generates a summary report after collecting data. This report includes information on the following fields:
- Elapsed time
- GPU utilization information
- Information about the hottest computing tasks
- Recommendations
Intel® Vtune Profiler provides a command line interface for remote analysis, scripted commands, and performance regression checks to monitor software performance over time. The vtune command line interface (CLI) provides an extensive set of options with which you can perform almost every task that is possible through the GUI.
To collect profiling data the following script can be run in the command line:
#!/bin/bash
source /opt/intel/oneapi/setvars.sh
#Vtune GPU Hotspot script
bin="MonteCarlo"
prj_dir="vtune_data"
echo $bin
rm -r ${prj_dir}
echo "Vtune Collect hotspots"
vtune -collect gpu-hotspots -result-dir ${prj_dir} $(pwd)/${bin}
echo "Vtune Summary Report"
vtune -report summary -result-dir ${prj_dir} -format html -report-output $(pwd)/vtune_${bin}.html
Make sure the above script "vtune_report.sh" file is in the same location as the application binary, make any necessary changes to the binary name in script if your binary name is different, run the script to collect VTune Profiling data and generate html report, the HTML report will look like this:
Figure 1: VTuneTM Profiler metrics.
Figure 1 is the snapshot from VTune Profiler, which represents the total elapsed time of the MonteCarloMultiGPU SYCL migrated code.
Intel® Advisor Roofline
The Roofline analysis is a combination of the Survey analysis followed immediately by the Trip Counts/FLOPs analysis. The Trip Counts/FLOPs analysis may run three to four times longer than the Survey analysis. A Roofline chart is a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks. Roofline requires data from both the survey and trip counts with flops analysis types. You can choose to run these analyses separately or use a shortcut command that will run them one after the other. Review the Get Started with Intel® Advisor for more details.
To collect profiling data the following script can be run in the command line:
#!/bin/bash
source /opt/intel/oneapi/setvars.sh
#Advisor Roofline script
bin="MonteCarlo"
prj_dir="./roofline_data"
echo $bin
rm -r ${prj_dir}
advisor --collect=survey --project-dir=${prj_dir} --profile-gpu -- ./${bin} -q
advisor --collect=tripcounts --project-dir=${prj_dir} --flop --profile-gpu -- ./${bin} -q
advisor --report=roofline --gpu --project-dir=${prj_dir} --report-output=./roofline_gpu_${bin}.html -q
Make sure the above script "roofline_report.sh" file is in the same location as the application binary, make any necessary changes to the binary name in script if your binary name is different, run the script to collect Intel Advisor Roofline data and generate html report.
Figure 2: GPU Roofline chart.
The GPU Roofline chart shows performance of the application in terms of memory and compute. The x‑axis represents the arithmetic intensity and y-axis represents the performance.
- Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPs) and/or integer operations (INTOPs) per byte, based on the loop/function algorithm, transferred between CPU/VPU and memory
- Performance (y axis) - measured in billions of floating-point operations per second (GFLOPS) and/or billions of integer operations per second (GINTOPS)
Source Code Links
CUDA source | Github link |
SYCL source—manual migration 1-1 mapping | |
SYCL source—manual migration with optimization applied | |
SYCL source—DPCT output with unmigrated code | |
SYCL source—DPCT output with implemented unmigrated code |