How to Deliver Uncompromised Performance for Diverse Workloads across Multiple Architectures
Nitya Hariharan, application engineer; Rama Kishan Malladi, performance modeling engineer; Amarpal S. Kapoor, technical consulting engineer; Kevin P. O’Leary, technical consulting engineer; Intel Corporation
Getting the maximum achievable performance out of today’s hardware is a fine balance between optimal use of underlying hardware features and using code that is portable, easily maintainable, and power-efficient. These factors don’t necessarily work in tandem. They require prioritizing based on user needs. It’s nontrivial to maintain separate code bases for different architectures. A standard, simplified programming model that can run seamlessly on scalar, vector, matrix, and spatial architectures gives developers greater productivity through increased code reuse and reduced training investment.
oneAPI is an industry initiative designed to deliver these benefits. It’s based on standards and open specifications and includes the Data Parallel C++ (DPC++) language—an extension of the International Organization for Standardization (ISO) C++ based on the SYCL* specification from Khronos Group*—as well as a set of domain libraries. The goal of oneAPI is for hardware vendors across the industry to develop their own compatible implementations targeting their CPUs and accelerators. That way, developers only need to code in a single language and a set of library APIs across multiple architectures and multiple vendor devices.
The Intel oneAPI developer tools implementation, targeting Intel® CPUs and accelerators, consists of the Intel® oneAPI Base Toolkit (Base Kit) along with multiple domain-specific toolkits that cater to different users in high-performance computing (HPC), AI, IoT, and rendering and ray tracing.
Figure 1 shows the different layers that are part of the Intel oneAPI product and the Base Kit, which consists of:
- The Intel® oneAPI DPC++/C++ Compiler
- The Intel® DPC++ Compatibility Tool
- Multiple optimized libraries
- Advanced analysis and debugging tools
Parallelism across architectures is expressed using the DPC++ language, which is based on SYCL from the Khronos Group. It uses modern C++ features along with Intel-specific extensions for efficient architecture use. DPC++ language features allow code to be run on the CPU and to be offloaded onto an available accelerator—making it possible to reuse code. A fallback property allows the code to be run on the CPU when an accelerator isn’t available. Running it on the host and accelerator, along with the memory dependencies, are clearly defined.
You can port code from CUDA* to DPC++ using the Intel DPC++ Compatibility Tool. It assists developers with a one-time migration and typically migrates 80 to 90 percent of the code automatically.
In addition to DPC++, the Intel® oneAPI HPC Toolkit (HPC Kit) supports OpenMP* 5.0 features that allow code to be offloaded onto a GPU. You can either transition to using DPC++ or the offload features on your existing C, C++, or Fortran code. API-based programming is supported through a set of libraries (for example, the Intel® oneAPI Math Kernel Library), which is optimized for Intel® GPUs.
The Intel oneAPI product also offers new features in Intel® VTune™ Profiler1 and Intel® Advisor2. They allow you to debug code and look at performance-related metrics when code is offloaded onto an accelerator.
Figure 1. Components of the Base Kit
This article introduces the oneAPI product to facilitate heterogeneous programming. We introduce the oneAPI software model and then discuss the compilation model and the binary generation procedure. oneAPI provides a single binary for all architectures, so the compile and link steps are different from normal methods of binary generation. Finally, we examine some sample programs.
Note We use the terms accelerator, target, and device interchangeably throughout this article.
oneAPI Software Model
This model, based on the SYCL specification, describes the interaction between the host and device in terms of running code and memory use. The model has four parts:
- A platform model specifying the host and device
- A model specifying the command queues and the commands that are run on the device
- A memory model specifying memory use between the host and device
- A kernel model that targets computational kernels to devices
The oneAPI platform model specifies the host and multiple devices that communicate with each other or the host. If there are multiple devices, the host controls the running kernels on the devices and coordinates among them. Each device can have multiple compute units. And each compute unit can have multiple processing elements. The oneAPI specification can support multiple devices like GPUs, FPGAs, and application-specific integrated circuit (ASIC) as long as the platform satisfies the minimum requirements of the oneAPI software model. This typically means the hosts need to have a specific operating system, a specific GNU Compiler Collection (GCC)* version, and certain drivers needed by the devices. (See the release notes for each oneAPI component for details on the platform requirements.)
This oneAPI model specifies how the code runs on the host and device. The host execution model creates command groups to coordinate running the kernels and data management between the host and devices. The command groups are submitted in queues that can be run with either an in-order or out-of-order policy. Commands within a queue can be synchronized to ensure data updates on the device are available to the host before the next command runs.
The device execution model specifies how the computation is done on the accelerator. Its running ranges over a set of elements that can either be a one-dimensional or multidimensional dataset. This range is split into a hierarchy of ND-Range, work groups, subgroups, and work items as shown in Figure 2 for a 3D case.
Figure 2. Relationship between ND-Range, work groups, subgroups, and work items
Note This is similar to the SYCL model. The work item is the smallest unit to run in the kernel. And the work groups determine how data is shared among these work items. These hierarchical layouts also determine the kind of memory to use for better performance. For example, work items typically operate on temporary data that's stored in the device memory and work groups use global memory. The subgroup classification, a new feature of SYCL 2020, supports hardware resources that have a vector unit. This allows parallel execution on elements.
From Figure 2, it’s clear that the location of the work group or work item within ND-Range is important, since this determines the data point being updated within the computational kernel. The index into ND-Range that each work item acts upon is determined using intrinsic functions in the nd_item class (global_id, work_group_id, sub_group_id, and local_id).
The oneAPI memory model defines how the host and device handle memory objects. It helps you decide where memory is allocated depending on the application needs. Memory objects are classified as type buffer or images. An accessor can be used to indicate the location of the memory object and the mode of access. The accessor provides different access targets for objects residing on the host, global and local memory on the device, or images residing on the host. The access types can be read, write, atomic, or read and write.
The unified shared memory model allows the host and device to share memory without the use of explicit accessors. Synchronization using events manages the dependencies between host and device. You can either explicitly specify an event to control when data updated by a host or a device is available for reuse, or implicitly depend on the runtime and device drivers to determine this.
Kernel Programming Model
The oneAPI kernel programming model specifies the code that runs on the host and device. Parallelism isn’t automatic. You need to specify it explicitly using language constructs.
The DPC++ language requires a compiler that can support C++11 and later features on the host side. The device code, however, requires a compiler that supports C++03 features and certain C++11 features like lambda expressions, variadic templates, rvalue references, and alias templates. It also requires std::string, std::vector, and std::function support. Restrictions on certain features for the device code include virtual functions and virtual inheritance, exception handling, runtime type information (RTTI), and object management employing new and delete operators.
You can decide to use different schemes to describe the separation between the host and device code. A lambda expression can keep the kernel code in line with the host code. A functor keeps the host code in the same source file, but in a separate function. For users who are porting OpenCL™ code or who require an explicit interface between the host and device code, the kernel class provides the necessary interface.
You can implement parallelism in three different ways:
- A single task that runs the whole kernel in a single work item
- The parallel_for construct that distributes the tasks among the processing elements
- The parallel_for_work_group construct distributes the tasks among the work groups and can synchronize work items within a work group through barriers
oneAPI Compilation Model
The oneAPI compilation model consists of build and link steps. However, the generated binary needs to support the execution of the device code on multiple accelerators. This means a DPC++ compiler and linker have to carry out additional commands to generate the binary. This complexity is generally hidden from the user but can be useful for generating target-specific binaries.
The host code compilation is done in the default way for a standard x86 architecture. The binary generation for the accelerator is more complex because it needs to support single or multiple accelerators in addition to optimizations that are specific to each accelerator. This accelerator binary, known as a fat binary, contains a combination of:
- An intermediate Standard Portable Intermediate Representation (SPIR-V), which is device-independent and generates a device-specific binary during compilation.
- Target-specific binaries that are generated at the time of compilation. Since oneAPI is meant to support multiple accelerators, multiple code forms are created.
Multiple tools generate these code representations, including the clang driver, the host and device DPC++ compiler, the standard Linux* (ld) or Windows* (link.exe) linker, and tools to generate the fat object file. During execution, the oneAPI runtime environment checks for a device-specific image within the fat binary and executes it, if available. Otherwise, the SPIR-V image is used to generate the target-specific image.
oneAPI Programming Examples
In this section, we look at sample code for the Intel oneAPI DPC++ Compiler, OpenMP device offload, and the Intel DPC++ Compatibility Tool.
Writing DPC++ Code
Writing DPC++ code requires you to exploit the APIs and syntax of the language. Listing 1 shows some sample code migration from C++ (CPU) code to a DPC++ (host and accelerator) code. It’s an implementation of the Högbom CLEAN algorithm posted on GitHub4. The algorithm iteratively finds the highest value in the image and subtracts a small gain of this point source convolved with the point spread function of the observation until the highest value is smaller than some threshold. The implementation has two functions: findPeak and subtractPSF. These must be ported from C++ to DPC++ as shown in Listings 1 and 2.
Figures 3 and 4 show the baseline and DPC++ implementation of the subtractPSF code.
Code changes required to port from C and C++ to DPC++ include:
- Introduction of the device queue for a given device (using the device selector API)
- Buffers created and accessed on the device (using the sycl::buffer/get_access APIs)
- Invocation of the parallel_for to spawn and run the computational kernel
- Wait for the completion of the kernel execution (and optionally catch any exceptions)
- Intel® DPC++ Compiler and flags: clang++ -fsycl -std=c++11 -O2 -lsycl -lOpenCL
The following images show the code changes for the findPeak function implementation. To better exploit parallelism in the hardware, DPC++ code has support for local_work_size, global_id/local_id, workgroup, and many other APIs, similar to the constructs used in the OpenCL platform and OpenMP*.
Baseline (Figure 5) and DPC++ (Figure 6) illustrate the implementation of the findPeak code. clPeak is a structure of value and position data. Concurrent execution of work groups is accomplished using the global and local IDs, and barrier synchronization across multiple threads (work items) in a work group. The result of this parallel_for execution is further reduced (not shown) to determine the maximum value and position across work groups.
OpenMP Offload Support
The HPC Kit provides OpenMP offload support that enables you to take advantage of OpenMP device offload features. We look at a sample open-source Jacobi code3 written in C++ with OpenMP pragmas. The code has a main iteration step that:
- Calculates the Jacobi update
- Calculates the difference between the old and new solution
- Updates the old solution
- Calculates the residual
The iteration code snippet is shown in Figure 7, a sample Jacobi solver with OpenMP pragmas.
Figure 8 shows the updated code with the omp target clause that's used to specify the data to transfer to the device environment with a data modifier that can be either:
Since array b is not modified, we use the clause to. And since x and xnew are initialized before the offload directives and updated within the device environment, we use the tofrom clause. The reduction variables d and r are also set and updated during each iteration and have the tofrom map clause.
Figure 8 shows a sample Jacobi solver updated with OpenMP offload pragmas.
To compile the offload target code with the oneAPI compiler, set the following:
- The environment variables pertaining to the compiler path
- The relevant libraries
- The different components
The path to these environment variables depends on the oneAPI setup on the user machine. We are looking at the compilation process, which is similar across machines, to demonstrate the ease of use of the specification. To compile the code, use the LLVM-based icx or icpc -qnextgen compiler as follows:
$ icpx -fiopenmp -fopenmp-targets=spir64 -D__STRICT_ANSI__ jacobi.cpp -o jacobi
The -D__STRICT__ANSI flag ensures compatibility with GCC 7.x and higher systems. The spir64 flag refers to the target-independent representation of the code and is ported to target-specific code during the link stage or while running. To run the code, run these commands:
$ export OMP_TARGET_OFFLOAD=”MANDATORY” $ export LIBOMPTARGET_DEBUG=1 $./jacobi
The MANDATORY option for OMP_TARGET_OFFLOAD indicates that the offload has to be run on the GPU. It’s set to DEFAULT by default, which indicates offload can be run on CPU and GPU. The LIBOMPTARGET_DEBUG flag, when set, provides offload runtime information that helps in debugging.
The OpenMP offload support example is for C and C++ programs, but Fortran offload is also supported. This allows HPC users with Fortran code bases to run their code on GPUs as well.
Intel® DPC++ Compatibility Tool
The Intel DPC++ Compatibility Tool is a command-line-based code migration tool available as part of the Base Kit. Its primary role is to enable the porting of existing CUDA source code to DPC++. Source lines where automatic migration isn’t possible are flagged through suitable errors and warnings. The Intel DPC++ Compatibility Tool also inserts comments in source locations where user interventions are necessary.
Figure 9 shows a typical workflow that CUDA users can use to port their source code to DPC++. The Intel DPC++ Compatibility Tool currently supports the Linux and Windows operating systems. This article assumes a Linux environment. The Intel DPC++ Compatibility Tool requires header files that are shipped with the CUDA SDK. To demonstrate the migration process, we use the VectorAdd sample from CUDA SDK 10.1, typically found in a location similar to:
$ ls /usr/local/cuda-10.1/samples/0_Simple/vectorAdd
Figure 9. Recommended workflow for migrating existing CUDA applications
VectorAdd is a single-source example with around 150 lines of code. The CUDA kernel-device code in this case computes the vector addition of arrays A and B into array C.
Note The commands, paths, and procedure shown here are correct at the time of publishing. Some changes may be introduced in the final version of the product.
To initialize the environment for using the Intel DPC++ Compatibility Tool, run the following command:
$ source /opt/intel/inteloneapi/setvars.sh
The setvars.sh script not only initializes the environment for the Intel DPC++ Compatibility Tool, but all other tools available in the Base Kit.
We use a simplified version of the CUDA makefile for compiling CUDA code, as shown in Figure 10.
The next step intercepts commands issued as the makefile runs and stores them in a compilation database file in JSON format. The Intel DPC++ Compatibility Tool provides a utility called intercept-build for this purpose. Here’s a sample invocation:
$ intercept-build make
Then invoke the migration step:
$ dpct -p compile_commands.json --in-root=. --out-root=dpct_output vectorAdd.cu
The –in-root and –out-root flags set the location of user program source and location where the migrated DPC++ code must be written. This step generates ./dpct_output/vectorAdd.dp.cpp.
To ensure that vector addition deploys onto the integrated GPU, it makes an explicit specification of the GPU queue instead of submitting it to the default queue. It gets the list of supported platforms with the list of devices for each platform by calling get_platforms() and platform.get_devices(). With the target device identified, it constructs a queue for the integrated GPU and dispatches the vector add kernel to this queue. Such a methodology may be used to target multiple independent kernels to different target devices connected to the same host or node.
Next, compile the modified DPC++ code using:
$ dpcpp -std=c++11 -I=/usr/local/cuda-10.1/samples/common/inc vectorAdd.dp.cpp -lOpenCL
This invokes the resulting binary. It confirms the vector addition is running on the integrated GPU, shown in Figure 11. You'll see output from running the ported DPC++ code on the integrated GPU.
For details on these tools, use these help flags:
$ intercept-build –h $ dpct –h
This article introduced oneAPI and the Intel® oneAPI Toolkits and outlined the components that are part of the Base Kit. Intel oneAPI products include toolkits to help users in the HPC, AI analytics, IoT, and video analytics domains. The DPC++ programming guide provides complete details on the various constructs supported for optimized accelerator performance. The OpenMP example shown in the article is for a C++ program. However, GPU offload is supported for C and Fortran as well. oneAPI provides the software ecosystem you need to port and run your code on multiple accelerators.
You May Also Like
oneAPI: The Path to Streamlined
oneAPI for All: A New Initiative for
Break Boundaries with DPC++
Developer-to-Developer: DPC++ Key
Introducing oneAPI: A Unified, Cross-Architecture Performance Programming Model
oneAPI: Uncompromised Performance for
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.