Intel® FPGA SDK for OpenCL™ Pro Edition: Programming Guide

ID 683846
Date 6/21/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

5.4.5.6. Use Models of Intel® FPGA SDK for OpenCL™ Channels Implementation

Concurrent kernel execution can improve the efficiency of channels on an FPGA.
To achieve concurrent execution, the host launches the kernels in parallel. The kernels can communicate with each other through channels where applicable.

The following use models provide an overview on how to exploit concurrent execution safely and efficiently.

Feed-Forward Design Model

Implement the feed-forward design model to send data from one kernel to the next without creating any cycles between them. Consider the following code example:


__kernel void producer (__global const uint * src,
                        const uint iterations)
{
    for (int i = 0; i < iterations; i++) 
    {
        write_channel_intel(c0, src[2*i]);
        
        write_channel_intel(c1, src[2*i+1]);
    }
}

__kernel void consumer (__global uint * dst,
                        const uint iterations)
{
    for (int i = 0; i < iterations; i++) 
    {
        dst[2*i] = read_channel_intel(c0);
        
        dst[2*i+1] = read_channel_intel(c1);
    }
}

The producer kernel writes data to channels c0 and c1. The consumer kernel reads data from c0 and c1. The figure below illustrates the feed-forward data flow between the two kernels:

Figure 8. Feed-Forward Data Flow


Buffer Management

In the feed-forward design model, data traverses between the producer and consumer kernels one word at a time. To facilitate the transfer of large data messages consisting of several words, you can implement a ping-pong buffer, which is a common design pattern found in applications for communication. The figure below illustrates the interactions between kernels and a ping-pong buffer:

Figure 9. Feed-Forward Design Model with Buffer Management


The manager kernel manages circular buffer allocation and deallocation between the producer and consumer kernels. After the consumer kernel processes data, the manager receives memory regions that the consumer frees up and sends them to the producer for reuse. The manager also sends to the producer kernel the initial set of free locations, or tokens, to which the producer can write data.

The following figure illustrates the sequence of events that take place during buffer management:

Figure 10. Kernels Interaction during Buffer Management


  1. The manager kernel sends a set of tokens to the producer kernel to indicate initially which regions in memory are free for producer to use.
  2. After manager allocates the memory region, producer writes data to that region of the ping-pong buffer.
  3. After producer completes the write operation, it sends a synchronization token to the consumer kernel to indicate what memory region contains data for processing. The consumer kernel then reads data from that region of the ping-pong buffer.
    Note: When consumer is performing the read operation, producer can write to other free memory locations for processing because of the concurrent execution of the producer, consumer, and manager kernels.
  4. After consumer completes the read operation, it releases the memory region and sends a token back to the manager kernel. The manager kernel then recycles that region for producer to use.

Implementation of Buffer Management for OpenCL Kernels

To ensure that the SDK implements buffer management properly, the ordering of channel read and write operations is important. Consider the following kernel example:

__kernel void producer (__global const uint * restrict src,
                        __global volatile uint * restrict shared_mem, 
                        const uint iterations)
{
    int base_offset;
    
    for (uint gID = 0; gID < iterations; gID++) 
    {
        // Assume each block of memory is 256 words
        uint lID = 0x0ff & gID;

        if (lID == 0)
        {
            base_offset = read_channel_intel(req);
        }

        shared_mem[base_offset + lID] = src[gID];

        // Make sure all memory operations are committed before
        // sending token to the consumer
        mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_CHANNEL_MEM_FENCE);

        if (lID == 255)
        {
            write_channel_intel(c, base_offset);
        }
    }
}

In this kernel, because the following lines of code are independent, the Intel® FPGA SDK for OpenCL™ Offline Compiler can schedule them to execute concurrently:

shared_mem[base_offset + lID] = src[gID];

and

write_channel_intel(c, base_offset);

Writing data to base_offset and then writing base_offset to a channel might be much faster than writing data to global memory. The consumer kernel might then read base_offset from the channel and use it as an index to read from global memory. Without synchronization, consumer might read data from producer before shared_mem[base_offset + lID] = src[gID]; finishes executing. As a result, consumer reads in invalid data. To avoid this scenario, the synchronization token must occur after the producer kernel commits data to memory. In other words, a consumer kernel cannot consume data from the producer kernel until producer stores its data in global memory successfully.

To preserve this ordering, include an OpenCL mem_fence token in your kernels. The mem_fence construct takes two flags: CLK_GLOBAL_MEM_FENCE and CLK_CHANNEL_MEM_FENCE. The mem_fence effectively creates a control flow dependence between operations that occur before and after the mem_fence call. The CLK_GLOBAL_MEM_FENCE flag indicates that global memory operations must obey the control flow. The CLK_CHANNEL_MEM_FENCE indicates that channel operations must obey the control flow. As a result, the write_channel_intel call in the example cannot start until the global memory operation is committed to the shared memory buffer.