More Productive and Performant C++ Programming with oneDPL

Published: 02/07/2022  

Last Updated: 02/07/2022

By Pablo Reble

We have come a long way since 2005 when Herb Sutter declared that The Free Lunch is Over, referring to challenges that programmers face with emerging multicore processors. Today's portfolio of computing architectures and accelerators is even richer and constantly growing: a development driven by fundamental limitations of semiconductors and the desire for more powerful, energy-efficient computing. The computing world is becoming more heterogeneous, which creates challenges for programmers.

C++ is still among the five most popular programming languages (TIOBE ranks it as fourth as of January 2022). Attributes like full control over memory management and support for generic programming make it a great language to tackle heterogeneous programming challenges. Developer productivity and the cost of code maintenance are common concerns when choosing a programming language. Fortunately, previous studies show that you can expect a productivity boost by combining parallel building blocks with C++ algorithms. For example, optimized, built-in implementations of common functions and patterns (such as reduction) for specific architectures improve both performance and developer productivity.1, 2

Our industry-leading implementation of the oneAPI Data Parallel C++ Library (oneDPL)3 contributed to the open-source LLVM project. As a result, developer effort can be significantly reduced in a multithreaded world.1, 4

Supercharged Classic STL Algorithms

Boost your code with something old and something new.

The C++ language itself is evolving, and so is its Standard Template Library (STL). For example, five years ago run policies were added to the algorithms' library so that even existing C++ codes can benefit from the common parallelism of modern processors. You can think of oneDPL as a supercharged C++ STL that allows different vendors to implement accelerated versions of classic algorithms in a portable way.

oneDPL implements the C++ algorithms library using SYCL*:5

"SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that enables code for heterogeneous processors to be written using standard ISO C++ with the host and kernel code for an application contained in the same source file."5

For C++ programmers, there is only a modest learning curve for programming accelerators in SYCL. While C++ gives programmers full control over memory management, it has no concept of separate host and device memories. SYCL adds this, and oneDPL relies on SYCL memory abstraction as a portable way to share data between host and devices. oneDPL algorithm functions are ready to use, familiar to C++ programmers, and optimized for various accelerators. This makes it easier to learn for C++ programmers and improves code performance and developer productivity.

Here’s a simple example to illustrate the power of oneDPL:

#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/iterator>
#include <iostream>

int main()
{
   std::vector<int> data{1, 1, 1, 2, 1, 1, 1};

   auto policy = oneapi::dpl::execution::dpcpp_default;
   auto maxloc = oneapi::dpl::max_element(policy, data.cbegin(), data.cend());

   std::cout << "Run on "
             << policy.queue().get_device().template get_info<sycl::info::device::name>()
             << std::endl;
   std::cout << "Maximum value is at element " << oneapi::dpl::distance(data.cbegin(), maxloc) << std::endl;

   return 0;
}

This example offloads the common maxloc reduction (that is, finding the element in the dataset with the maximum value) to the accelerator specified in the run policy. The included headers are conformant with ISO C++, and so is the blocking behavior of max_element. Data movement is handled implicitly in this example. In other words, the runtime automatically handles host-device data transfer by wrapping the data in a SYCL buffer if the computation is offloaded to an accelerator. Other modes exist that allow the programmer to explicitly control host-device data transfer.

In addition to parallel algorithm implementations in SYCL, oneDPL supports essential extensions for device programming such as custom iterators. To ensure interoperability across different platforms, such extensions were added to the oneDPL specification.6

What’s Next?

A Look into the Crystal Ball

Let’s focus on some powerful, experimental oneDPL features that are under development but have not been fully baked into ISO C++, and how to get access to them:

  • C++20 introduces Ranges that can greatly improve expressiveness when using C++ STL algorithms. They extend the utility of algorithms by supporting more complex data access patterns with Views. All this with fewer lines of code. ISO C++ Ranges algorithms do not support run policies, which means it lacks accelerator support. oneDPL enables Ranges for selected algorithms and provides extensions (such as custom SYCL views) to enable device programming.7
  • Classic C++ algorithms are well defined, including the blocking behavior of their function calls. However, blocking the host processor is not always desirable when offloading computation to an accelerator. To allow interleaving of a running host device and data transfer, a set of asynchronous algorithms has been added to oneDPL. Their functionality is similar to C++ algorithms but without the blocking behavior. To control nonblocking behavior, a C++ future-like object is returned instead of the result directly.8

There’s more to come. Other exciting features like automatic device selection are planned for future releases so stay tuned and follow us on GitHub*.

Final Thoughts: How to Learn More

oneDPL provides C++ building blocks that combine high performance with high productivity across CPUs, GPUs, FPGAs, and other accelerators. It is based on open standards, and its specification ensures interoperability across different platforms. The oneAPI open-source implementation is a permissively licensed open-source project.5

Learn more about programming with oneAPI and oneDPL:

References

  1. Parallel Research Kernels
  2. Analyze Reduction Abstraction Capabilities
  3. oneAPI DPC++ Library
  4. How to Boost Performance with Parallel STL and C++17 Parallel Algorithms
  5. SYCL Programming Language
  6. oneDPL Specification
  7. oneDPL Range-Based API Algorithms
  8. oneDPL Asynchronous API Algorithms

______

You May Also Like

 


Simplify Cross-Architecture Programming
with the oneAPI High-Productivity API Library
Watch


Reduce Cross-Platform Programming Efforts
and Achieve High-Performance Parallel Code with oneDPL
Watch

Compare the Benefits of CPUs, GPUs,
and FPGAs for Your Heterogeneous Workloads
Watch

 

Use the oneAPI Level Zero Interface
Read

 

Intel® oneAPI Base Toolkit

Get started with this core set of tools and libraries for developing high-performance, data-centric applications across diverse architectures.

Get It Now

See All Tools

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.