Programming Guide


Data Parallelism in C++ using SYCL*

Open, Multivendor, Multiarchitecture support for productive data parallel programming in C++ is accomplished via standard C++ with support for SYCL. SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that enables code for heterogeneous processors to be written using standard ISO C++ with the host and kernel code for an application contained in the same source file. The DPC++ open source project is adding SYCL support to the LLVM C++ compiler.

Simple Sample Code

The best way to introduce SYCL is through an example. Since SYCL is based on modern C++, this example uses several features that have been added to C++ in recent years, such as lambda functions and uniform initialization. Even if developers are not familiar with these features, their semantics will become clear from the context of the example. After gaining some experience with SYCL, these newer C++ features will become second nature.
The following application sets each element of an array to the value of its index, so that a[0] = 0, a[1] = 1, etc.
#include <CL/sycl.hpp> #include <iostream> constexpr int num=16; using namespace sycl; int main() { auto r = range{num}; buffer<int> a{r}; queue{}.submit([&](handler& h) { accessor out{a, h}; h.parallel_for(r, [=](item<1> idx) { out[idx] = idx; }); }); host_accessor result{a}; for (int i=0; i<num; ++i) std::cout << result[i] << "\n"; }
The first thing to notice is that there is just one source file: both the host code and the offloaded accelerator code are combined in a single source file. The second thing to notice is that the syntax is standard C++: there aren’t any new keywords or pragmas used to express the parallelism. Instead, the parallelism is expressed through C++ classes. For example, the
class on line 9 represents data that will be offloaded to the device, and the
class on line 11 represents a connection from the host to the accelerator.
The logic of the example works as follows. Lines 8 and 9 create a buffer of 16
elements, which have no initial value. This buffer acts like an array. Line 11 constructs a
, which is a connection to an accelerator device. This simple example asks the SYCL runtime to choose a default accelerator device, but a more robust application would probably examine the topology of the system and choose a particular accelerator. Once the queue is created, the example calls the
member function to submit work to the accelerator. The parameter to this
function is a lambda function, which executes immediately on the host. The lambda function does two things. First, it creates an
on line 12, which can write elements in the buffer. Second, it calls the
function on line 13 to execute code on the accelerator.
The call to
takes two parameters. One parameter is a lambda function, and the other is the
object “
” that represents the number of elements in the buffer. SYCL arranges for this lambda to be called on the accelerator once for each index in that range, i.e. once for each element of the buffer. The lambda simply assigns a value to the buffer element by using the
accessor that was created on line 12. In this simple example, there are no dependencies between the invocations of the lambda, so the program is free to execute them in parallel in whatever way is most efficient for this accelerator.
After calling
, the host part of the code continues running without waiting for the work to complete on the accelerator. However, the next thing the host does is to create a
on line 18, which reads the elements of the buffer. The SYCL runtime knows this buffer is written by the accelerator, so the
constructor (line 18) is blocked until the work submitted by the
is complete. Once the accelerator work completes, the host code continues past line 18, and it uses the
accessor to read values from the buffer.

Additional Resources

This introduction to SYCL is not meant to be a complete tutorial. Rather, it just gives you a flavor of the language. There are many more features to learn, including features that allow you to take advantage of common accelerator hardware such as local memory, barriers, and SIMD. There are also features that let you submit work to many accelerator devices at once, allowing a single application to run work in parallel on many devices simultaneously.
The following resources are useful to learning and mastering SYCL using a DPC++ compiler:

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at