Offload C++ Standard Parallel Code to SYCL* Device Using oneDPL

11/14/2024

Nikita Shiledarbaxi

Software Tools Technical Marketing Engineer

Robert Mueller-Albrecht

Software Tools Technical Marketing Manager

Intel Corporation

The Parallel Standard Template Library (Parallel STL or pSTL) enables parallel and vectorized execution of C++ algorithms. By offloading Parallel STL algorithms to multiple devices (CPUs or GPUs) supporting the SYCL* programming framework, you can enhance application performance by harnessing the computing potential of heterogeneous architectures and the cross-platform parallelism capabilities of SYCL. The Intel® oneAPI DPC++^[1] Library (oneDPL) lets you offload Parallel STL code to SYCL devices, enabling multiarchitecture, accelerated parallel programming across heterogeneous hardware.

This article will discuss a code sample demonstrating how the oneDPL pSTL_offload preview feature helps offload C++ Parallel STL code to a SYCL device.

Intel® oneAPI DPC++ Library (oneDPL): An Overview

oneDPL, in combination with Intel® oneAPI DPC++/C++ Compiler, helps expedite SYCL kernels for accelerated parallel programming on diverse architectures and hardware accelerators. Its Parallel API provides parallel extensions of C++ STL algorithms, execution policies and range-based algorithms, enabling efficient execution of C++ STL styled code in parallel on multi-core CPUs and offload it to GPUs. It supports developer-familiar parallel computing libraries such as Parallel STL and Boost.Compute*. Its SYCL-dedicated API helps accelerate SYCL kernels on GPUs. While the Device Selection API of oneDPL allows you to dynamically allocate available compute resources to your workload based on pre-defined device execution policies.

The library seamlessly integrates with the Intel® DPC++ Compatibility Tool and its open counterpart SYCLomatic tool for easy, automated CUDA* to SYCL code migration for multiarchitecture programming free from vendor lock-in.

About the Code Sample

The pSTL offload code sample illustrates how to offload C++ standard parallel algorithms to SYCL devices (CPUs and GPUs) with minimal code changes. It takes advantage of an experimental oneDPL feature, using the –fsycl-pstl-offload option with the Intel oneAPI DPC++/C++ Compiler.

The execution policies provided by the oneDPL Parallel API to run data parallel computations on heterogeneous devices include:

unseq for sequential execution

par for parallel execution

par_unseq that combines the effect of unseq and par policies

The code sample consists of three programs/sub-samples as follows:

FileWordCount counts the number of words in a file utilizing C++17 parallel algorithms),

WordCount counts the number of words generated utilizing C++17 parallel algorithms), and

ParSTLTests implements various STL algorithms with different execution policies mentioned above (unseq, par and par_unseq)

The code sample illustrates how STL algorithms called by the std:execution::par_unseq policy can be automatically offloaded to a specified SYCL device using the –fsycl-pstl-offload compiler option and standard header inclusion in the existing code.

The oneAPI programming model provides certain device selection environment variables that allow you to offload your SYCL or OpenMP* code to a specialized compute resource or an accelerator (such as CPU, GPU or FPGA). One such environment variable is the ONEAPI_DEVICE_SELECTOR, which limits the choice of devices out of all the available compute resources for executing the code in SYCL and OpenMP* based applications. The variable also allows sub-devices to be chosen as individual execution devices. Learn more about various environment variables for C++ with SYCL runtimes.

The code sample shows how you can specify a specific target device using the ONEAPI_DEVICE SELECTOR variable to offload the code to your chosen device. The offloaded code is then implemented using oneDPL. Without the pSTL offload compiler option, the code gets offloaded to the default SYCL device.

→ For more information on what each of the sub-samples does, refer to the ‘Key Implementation Details’ section.

→ Step-by-step instructions to execute the code sample are available in the ‘Build and Run the pSTL offload Samples’ section.

NOTE: The sample demonstrates offloading STL code to Intel Data Center GPU Max and Intel Xeon CPU. However, you can follow the same procedure for offloading C++ STL code to any SYCL device.

What’s Next?

Check out the pSTL offload code sample and learn how to unlock the potential of accelerated hardware devices for efficient parallel programming in C++ with SYCL. Get started with oneDPL and explore oneDPL code samples today to expedite SYCL kernels on the latest CPUs, GPUs and other accelerators!

We also encourage you to explore other AI and HPC tools powered by the unified oneAPI programming paradigm for high-performance, accelerated, multiarchitecture parallel computing.