Developer Guide


Double Buffering Host Utilizing Kernel Invocation Queue

Double buffering in SYCL* host application allows SYCL runtime environment to coalesce memory transfers and kernel execution.
In an application where the FPGA kernel is executed multiple times, the host must perform the following processing and buffer transfers before each kernel invocation.
  1. The output data from the
    invocation must be transferred from device to host and then processed by the host. Examples of this processing include:
    • Copying the data to another location
    • Rearranging the data
    • Verifying it in some way
  2. The input data for the
    invocation must be processed by the host and then transferred to the device. Examples of this processing include:
    • Copying the data from another location
    • Rearranging the data for kernel consumption
    • Generating the data in some way
Without double buffering, host processing and buffer transfers occur between kernel executions. Therefore, there is a gap in time between kernel executions, which you can refer as
kernel downtime
(See Figure 1). If these operations overlap with kernel execution, the kernels can execute back-to-back with minimal downtime, thereby increasing overall application performance.

Determine When is Double Buffering Possible

Consider the following illustration:
Double Buffering Host
Following are the definitions of the required variables:
  • R
    : Time to transfer the kernel's output buffer from device to host.
  • Op
    : Host-side processing time of kernel output data (output processing)
  • Ip
    : Host-side processing time for kernel input data (input processing)
  • W
    : Time to transfer the kernel's input buffer from host to device.
  • K
    : Kernel execution time
In general,
, and
operations must all complete before the next kernel is launched. To maximize performance, while one kernel is executing on the device, these operations should execute simultaneously on the host and operate on a second set of buffer locations. They should complete before the current kernel completes, thus allowing the next kernel to be launched immediately with no downtime. In general, to maximize performance, the host must launch a new kernel every
This leads to the following constraint:
R + Op + Ip + W <= K, to minimize kernel downtime
If the above constraint is not satisfied, a performance improvement may still be observed because of some overlap (perhaps not complete overlap) is still possible.

Measure the Impact of Double Buffering

You must get a sense of the kernel downtime to identify the degree to which this technique can help improve performance.
This can be done by querying the total kernel execution time from the runtime and comparing it to the overall application execution time. In an application where kernels execute with minimal downtime, these two numbers are close. However, if kernels have a lot of downtime, overall execution time notably exceeds kernel execution time.

Hardware Kernel Invocation Queue While Double Buffering Example

To utilize hardware kernel invocation queue while double buffering, write your host code as shown in the following code snippet:
main() { ... initialize_input(input_buf[0]); initialize_input(input_buf[1]); simple_kernel(device_queue, input_buf[0], output_buf[0]); for (int i = 1; i < TIMES; i++) { simple_kernel(device_queue, input_buf[i%2], output_buf[i%2]); // Launch the next kernel // Process output from previous kernel. // This will block on kernel completion. check_output(output_buf[(i-1)%2]); // Generate input for the next kernel. initialize_input(input_buf[(i-1)%2]); } ... }
The following is the example function definition for
void initialize_input (buffer<cl_float, 1> &inBuffer){ accessor buf_acc(inBuffer, write_only, no_init); for (int i = 0; i < N; i++) { buf_acc[i] = rand(); } }
For additional information, refer to the FPGA tutorial sample "Double Buffering" listed in the Intel® oneAPI Samples Browser on Linux* or Windows*, or access the code sample on GitHub.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at