One of the main influences on the overall performance of an FPGA design is how kernels executing on the FPGA interact with the host on the CPU.
Host and Kernel Interaction
FPGA devices typically communicate with the host (CPU) via PCIe.
FPGA Device Communication with the Host
This is an important factor influencing the performance of SYCL* programs targeting FPGAs. Furthermore, the first time you run a particular SYCL program, you must configure the FPGA with its hardware bitstream, and this may require several seconds.
Typically, the FPGA board has its own private Double Data Rate (DDR) memory on which it primarily operates. The CPU must bulk transfer or direct memory access (DMA) all data that the kernel needs to access into the FPGA’s local DDR memory. After the kernel completes its operations, results must be transferred over DMA back to the CPU. The transfer speed is bound by the PCIe link itself and the efficiency of the DMA solution. For example, the Intel® PAC with Intel® Arria® 10 GX FPGA has a PCIe Gen 3 x 8 link, and transfers are typically limited to 6-7 GB/s.
The following are the techniques to manage these data transfer times:
SYCL allows buffers to be tagged as read-only or write-only, which eliminates some unnecessary transfers.
Improve the overall system efficiency by maximizing the number of concurrent operations. Since PCIe supports simultaneous transfers in opposite directions and PCIe transfers do not interfere with kernel execution, you can apply techniques such as double buffering. Refer to the Double Buffering Host Utilizing Kernel Invocation Queue topic in the FPGA Optimization Guide for Intel® oneAPI Toolkits and the double_buffering tutorial for additional information about these techniques.
Improve data transfer throughput by prepinning system memory on board variants that support Restricted USM. Refer to the Prepinning topic in the FPGA Optimization Guide for Intel® oneAPI Toolkits for additional information.
You must program the hardware bitstream on the FPGA device in a process called configuration. Configuration is a lengthy operation requiring several seconds of communication with the FPGA device. The SYCL runtime manages configuration for you automatically. The runtime decides when the configuration occurs. For example, the configuration might be triggered when a kernel is first launched, but subsequent launches of the same kernel may not trigger configuration since the bitstream has not changed. Therefore, during development, Intel® recommends to time the execution of the kernel after the FPGA has been configured, for example, by performing a warm-up execution of the kernel before timing kernel execution. You must remove this warm-up execution in the production code.
Multiple Kernel Invocations
If a SYCL program submits the same kernel to a SYCL queue multiple times (for example, by calling single_task within a loop), only one kernel invocation is active at a time. Each subsequent invocation of the kernel waits for the previous run of the kernel to complete.