How to Move from OpenCL™ Application to Data Parallel C++
Robert Ioffe, senior exascale performance software engineer, Intel Corporation
Get the Latest on All Things CODE
Data Parallel C++ (DPC++) is a heterogeneous, portable programming language based on the Khronos SYCL* standard. This single-source programming language can target an array of platforms: CPUs, integrated and discrete GPUs, FPGAs, and other accelerators. To give you an idea of what DPC++ can do, we port a non-trivial OpenCL™ application, GPU-Quicksort, to DPC++ and document the experience. The goal is to exceed the capabilities of the initial application. OpenCL C makes it hard to write generic algorithms, and it becomes clear that it’s a serious shortcoming when you try to implement algorithms—like sorting—that need to work for different data types. The original GPU-Quicksort for OpenCL was written to sort unsigned integers. This article demonstrates how to use templates with DPC++ and implement GPU-Quicksort for multiple data types. Finally, we port GPU-Quicksort to Windows* and Ubuntu* 18.04 to show DPC++ portability.
What is GPU-Quicksort?
GPU-Quicksort is a high-performance sorting algorithm designed specifically for highly parallel, multicore graphics processors. It was invented in 2009 by Daniel Cederman and Phillippas Tsigas, a student and professor respectively from the Chalmers University of Technology in Sweden. GPU-Quicksort was originally implemented in CUDA*. In 2014, Robert Ioffe reimplemented it in OpenCL 1.2 and OpenCL 2.0 to demonstrate high performance on Intel® Processor Graphics and to showcase nested parallelism and work-group scan functions in OpenCL 2.0. It was fully implemented in OpenCL drivers. Learn how to port an OpenCL 1.2 implementation of the GPU-Quicksort to DPC++ and make the implementation generic so that it can sort not just unsigned integers, but also floats and doubles.
What is OpenCL™ Standard?
We’ll start with the OpenCL 1.2 implementation. Intel fully supports OpenCL standard, a Khronos standard for programming heterogeneous parallel systems, on various operating systems and platforms. OpenCL standard consists of:
- The runtime
- The host API
- The device C-based programming language OpenCL C
Here lie both its power and its limitations. The power is the ability to write high-performance, portable, heterogeneous parallel applications. Its main limitation is the necessity to write and debug two separate codes—the host side and the device side. There is also lack of templates and other C++ features that the modern programmers are accustomed to, which makes writing generic libraries in OpenCL standard difficult.
What’s Data Parallel C++?
DPC++ is an Intel implementation of Khronos SYCL* with extensions. The SYCL standard is designed to address the limitations of the OpenCL standard. DPC++ provides:
- A single-source programming model, which consists of a single code base for both host and device programming
- The full use of C++ templates and template metaprogramming on the device with minimal impact on performance without compromising portability
DPC++ lets a programmer target the CPUs, GPUs, and FPGAs while permitting accelerator-specific tuning—a definite improvement over OpenCL standard. It is also supported by Intel® Software Development Tools like Intel® VTune™ Profiler and Intel® Advisor, as well as by GDB*. This article makes full use of DPC++, especially its template features.
The Starting Point: Windows* Applications from 2014
We start with GPU-Quicksort for OpenCL 1.2 implementation (as described in the article GPU-Quicksort in OpenCL™ 2.0: Nested Parallelism and Work-Group Scan Functions). The original application was written for Windows, so we port it to Ubuntu 18.04 by adding the cross-platform code to measure time and use aligned_alloc/free for aligned memory allocation/deallocation, as opposed to _alligned_malloc/_aligned_free on Windows.
Let’s get a brief overview of GPU-Quicksort architecture. It consists of two kernels:
Written in OpenCL 1.2, these kernels are glued together by a dispatcher code, which iteratively calls gqsort_kernel until the input is split into small enough chunks, which can be fully sorted by lqsort_kernel. The application allows the user to select the following features:
- The number of times to run sort for measurement purposes
- The vendor and device on which to run the kernels
- The size of the input
- The permission to show the device details
The application follows a typical OpenCL architecture of supporting utilities to initialize OpenCL platforms and devices and building code for them. A separate file—with the OpenCL kernels and their supporting functions, and the main application that accepts user arguments—initializes the platform and device, builds the kernels, properly allocates memory, and creates buffers from it. The file, then binds them to the kernel arguments and launches the dispatcher function.
Data Parallel C++ and OpenCL Standard Interoperability: Platform Initialization
First, install the Intel® oneAPI Base Toolkit, which includes the Intel® oneAPI DPC++ Compiler. To port to DPC++, include CL/sycl.hpp header and use the namespace cl::sycl clause (to spare the verbosity of DPC++).
Now, instead of initializing a platform, a device, a context, and a queue in the OpenCL standard way, we do it in the concise DPC++ way.
We also need to retrieve the underlying OpenCL context, device, and queue, since the rest of the application is OpenCL based.
That’s our first iteration: configure and compile it with the Intel DPC++ Compiler and run it.
Data Parallel C++: How to Select an Intel® GPU
The shortcoming of the first iteration is that it always selects the default device, which may or may not be an Intel® GPU. To specify an Intel GPU, we need to write a custom device selector.
We use intel_gpu_selector to select an Intel GPU when the user asks for it.
Data Parallel C++: How to Set Kernel Arguments and Launch Kernels
The third iteration of our code uses DPC++ to set kernel arguments and launch kernels. The program is still built, and the kernels are obtained, in the OpenCL way. We use cl::sycl::kernel objects to wrap original OpenCL kernels.
For example, we replace several clSetKernelArg methods with set_arg DPC++ methods and clEnqueueNDRange calls with parallel_for calls. The example below shows gqsort_kernel, but a lqsort_kernel upgrade is similar.
Here’s a less verbose style to set all the arguments of the kernel with one set_args call.
We can also use a less verbose version of the parallel_for.
Data Parallel C++: How to Create Buffers and Set the Access Mode
We convert OpenCL buffers to DPC++ buffers. The first two are wrapping the memory that was alignallocated and passed into the function by reference. The other three are created from an STL vector. We use the template keyword in front of the get_access member function for buffers that we pass by reference. Note the different access modes for various buffers, depending on whether we need read or write access, or both. We do not directly pass buffers as kernel arguments; we pass the accessors to them.
Data Parallel C++: How to Query Platform and Device Properties
In OpenCL, we used the methods clGetPlatformInfo and clGetDeviceInfo to query various platform and device properties. Now, we can use get_info<> methods to query the same information. For example:
or query properties with a more complex structure.
Port OpenCL Kernels to Data Parallel C++, Part 1: gqsort_kernel
So far, we’ve initialized the platform and the device, created the buffers and their accessors and bound them to the kernels, and launched those kernels on the device in a DPC++ way. But, we still need to create the kernels in an OpenCL way. We use OpenCL C and clBuildProgram and clCreateKernel APIs to build the program and create kernels. The OpenCL C kernels are stored in a separate file that is loaded into the program at runtime before being built. We’ll change that, starting with the gqsort_kernel, the simpler of the two kernels.
The DPC++ way of creating kernels is via lambdas or functors. The use of lambdas for kernel creation is typically reserved for smaller kernels. When you have a more complex kernel that uses supporting functions, it’s a good idea to create a functor class. We’re going to create a gqsort_kernel_class functor and make it templated right from the start so that we can sort more than one datatype in the future.
A typical functor class has a void operator() that takes as a parameter an iteration id (in this case, a one-dimensional nd_item<1> id). The body of the kernel resides in the void operator(). The functor will also have a constructor that takes global and local accessors, the equivalent of global and local memory pointers, for an OpenCL kernel. The typical DPC++ functor will have a preamble, with using clauses defining various global and local accessor types. The gqsort_kernel looks as shown in the following example.
The private section of the functor contains all the global and local accessors used within the body of the void operator(). In this case, it will look as shown in the following example, with the first five accessors to global buffers and the rest to the local buffers.
gqsort_kernel is a complex kernel that uses supporting structs and two supporting functions: plus_prescan and median, which, in turn, use specialized OpenCL functions and extensively use local memory arrays and variables, local and global barriers, and atomics. All these elements must be translated into DPC++.
Let’s start with the functions. We omit structs, since they’re trivially templatized. The plus_prescan function that is used to calculate scan sums is relatively simple. So, to bring it to DPC++, the only change needed is to make it a template function and make this sort generic.
The next function is the median function. Make it a template function and replace the OpenCL C select function with the DPC++ cl::sycl::select function. Rename the DPC++ function as median_select to differentiate it from a similar host function.
In OpenCL C, it’s possible to both create local memory variables and arrays inside the body of the kernel and pass them as kernel parameters. But in DPC++, when using functors, we pass local buffer accessors when constructing the functor. In this case, all local memory variables and arrays will hold unsigned integers, so we will create a special local_read_write_accessor type.
We declare all the local memory variables.
We then pass them as parameters, along with global buffer accessors, to the functor constructor. The resulting object is passed to the parallel_for function.
Here, DPC++ lacks simplicity compared to OpenCL C. Next, the get_group_id and get_local_id functions are changed as sown in the following example.
Local barriers go from:
Global and local barriers go from:
For atomic operations, DPC++ is not as elegant as OpenCL C. So, what was concise:
Note the creation of cl::sycl::atomic<> variables prior to the use of DPC++ atomic operations, which cannot operate on the global or local memory pointers directly. So far, we have translated and templatized supporting structs and functions, converting specialized OpenCL C functions to DPC++. We’ve also created a template functor class with local accessors and translated barriers and atomics.
Porting OpenCL Kernels to Data Parallel C++, Part 2: lqsort_kernel
Translation of lqsort_kernel follows the familiar patterns outlined by the translation of gqsort_kernel: create a lqsort_kernel_class functor and then translate local memory arrays and variables and barriers (no atomics here). lqsort_kernel also uses supporting functions and structs. In addition to plus_prescan and median_select used by gqsort_kernel, we have bitonic_sort and sort_threshold functions, which are considerably more complex and specific to lqsort_kernel. After translation, they become the member functions of the lqsort_kernel_class. Their signatures change due to barrier use which, in the case of DPC++, requires the iteration objects. They work on local and global memory pointers, which require special handling so the OpenCL C signature:
These functions are translated similarly to gqsort_kernel, with the UINT_MAX macro being replaced with std::numeric_limits::max() to handle various data types in the future. When translating the lqsort_kernel, pointers to local memory (for example, local uint* sn;) are replaced with local_ptr<> objects (for example, local_ptr sn;). To retrieve the local pointer from the local accessor, we call the get_pointer member function of the accessor.
local_ptr<> and global_ptr<> objects work with pointer arithmetic, so what previously was d + d_offset, where d was a global pointer, becomes:
We translate local memory variables as accessors of size 1, meaning array accesses at index 0 (for example, gtsum). When we complete the lqsort_kernel translation, we fully transition to DPC++, but still sort unsigned integers. We did all the prework of templatizing supporting structs and functions and the functor classes of the two main kernels—and will enjoy the benefits.
The Power of Data Parallel C++: Templates and Their Caveats
The real power of DPC++ is the ability to use C++ templates, which enable writing generic code. We want the GPU-Quicksort to be generic and to be able to sort not only unsigned integers, but also other basic data types (for example, floats and doubles). In addition to the UINT_MAX to std::numeric_limits::max() change mentioned above, we need additional modification of the median_select function. cl::sycl::select takes a different type of the third argument, depending on the size of the type of the first two arguments, so we introduce the select_type_selector type traits class.
It allows us to convert a Boolean comparison to an appropriate type required by cl::sycl::select; median_select becomes:
To handle additional types, we need more specializations of select_type_selector. Now GPUQSort can sort floats and doubles on the GPU.
Back to Windows and RHEL*
To demonstrate DPC++ portability, we port the code to Windows and RHEL*. The RHEL port is minimal. We add the Intel imf math library at link time. Windows porting is slightly more complex. Add the following definitions when compiling.
Accounting for the fact that cl::sycl::select for doubles requires unsigned long long type as the third parameter (as opposed to unsigned long on Linux*), select_type_selector for doubles becomes:
On Windows, we undefine maximum and minimum to prevent the macro definitions from colliding with std::min and std::max. That’s all there is to it. We can sort unsigned integers, floats, and doubles using Intel GPUs on Windows and two Linux flavors.
We gradually translated GPU-Quicksort from its original OpenCL 1.2 into DPC++. At every step along the way, we had a working application. So, when you’re considering bringing DPC++ to your workflow, start small and either add on, or fully transition to DPC++ as time allows. Easily mix OpenCL and DPC++ in your code base and enjoy the benefits of both. Use legacy OpenCL kernels in their original form and enjoy the full power of C++ templates, classes, and lambdas when you’re developing new code in DPC++. Easily port code between Windows and various Linux flavors and choose which platform to develop on. You also have the full power of Intel tools to help you debug, profile, and analyze your DPC++ program.
- Khronos OpenCL™, the open standard for parallel programming of heterogeneous systems
- Khronos SYCL, C++ single-source heterogeneous programming for OpenCL™
- GPU-Quicksort: A Practical Quicksort Algorithm for Graphics Processors by Daniel Cederman and Philippas Tsigas
- GPU-Quicksort in OpenCL™ 2.0: Nested Parallelism and Work Group Scan Functions by Robert Ioffe
- Intel® oneAPI Toolkits
- Accompanying Code for This Article