Developer Guide


Pipe and Atomic Fence

This topic assumes that you already have an understanding of the
function described in the SYCL specification. If you are new to it, then before you proceed, read about the
function in the Khronos* SYCL Specification.
When running kernels in parallel, you might want multiple kernels to collaboratively access a shared memory. DPC++ provides the
function as a synchronization construct to reason about the order of memory instructions accessing the shared memory. The
function controls the reordering of memory load and store operations (subject to the associated memory order and memory scope) when paired with synchronization through an atomic object. Pipe read and write operations behave as if they are SYCL-relaxed atomic load and store operations. When paired with
functions to establish a synchronizes-with relationship, pipe operations can provide guarantee on side-effect visibility in memory, as defined by the SYCL memory model. For additional information about the
function, refer to the Khronos* SYCL Specification.
The current
function for FPGA uses an overly conservative implementation and is still preliminary.
  • The implementation guarantees only functional correctness and not the maximum performance because the
    function currently enforces more memory ordering than it requires. If you do not use the
    function with a correct
    parameter, then you might see unexpected behavior in your program when the
    function handles memory ordering properly in a future release.
  • The implementation does not support the
    constraint. The broadest scope supported for FPGA is the
Example Code for Using the
Function and Blocking Inter-Kernel Pipes
The following code sample shows how to use the
function with a blocking inter-kernel pipe to synchronize the load and store to a shared device memory between a producer and a consumer:
#include <CL/sycl.hpp> using namespace cl::sycl; using my_pipe = ext::intel::pipe<class some_pipe, int>; constexpr int READY = 1; int produce_data(int data); int consume_data(int data); event Producer(queue&q, int *shared_ptr, size_t size) { return q.submit([&](handler& h) { h.single_task<class ProducerKernel>([=]() [[intel::kernel_args_restrict]] { // create a device pointer to explicitly inform the compiler the // pointer resides in the device's address space device_ptr<int> shared_ptr_d(shared_ptr); // produce data for (size_t i = 0; i < size; i++) { shared_ptr_d[i] = produce_data(i); } // use atomic_fence to ensure memory ordering atomic_fence(memory_order::seq_cst, memory_scope::device); // notify the consumer to start data processing my_pipe::write(READY); }); } } event Consumer(queue & q, int* shared_ptr, size_t size, int *output_ptr) { return q.submit([&](handler& h) { h.single_task<class ConsumerKernel>([=]() [[intel::kernel_args_restrict]] { // create device pointers to explicitly inform the compiler these // pointer reside in the device's address space device_ptr<int> shared_ptr_d(shared_ptr); device_ptr<int> out_ptr_d(output_ptr); // wait on the blocking pipe_read until notified by the producer int ready = my_pipe::read(); // use atomic_fence to ensure memory ordering atomic_fence(memory_order::seq_cst, memory_scope::device); // consume data and write to output memory address for(int i = 0; i < size; i++) { out_ptr_d[i] = consume_data(shared_ptr_d[i]); } }); }); }
In the above example, the consumer loads data produced by the producer. To prevent a scenario where the consumer loads the shared device memory before the producer finishes storing to it, a blocking pipe is used to synchronize between the two kernels. The consumer’s pipe read does not return until it sees the
written by the producer. In this example, the
functions in the producer and consumer prevent the shared memory read and write from being reordered with the pipe instructions. They also form a release-acquire ordering, which ensures that by the time the consumer sees the pipe read returns, the producer's write operation to the shared device memory is also visible to the consumer.
The shared device memory is created using a USM device allocation that allows the two kernels to be running in parallel even though they both access the shared device memory simultaneously.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at