Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

ID 767853
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Prepinning Memory

You must consider how the transfer of data from the host to the device occurs when optimizing kernel memory accesses. For designs that have longer data transfer times than the compute time, the data transfer time may be a bottleneck. On devices supporting greater than a PCIe Gen3 x 8 transfer rates, prepinning the memory that is on the host prior to its transfer allows for it to transfer at a higher bandwidth.

For example, on the Intel® FPGA Programmable Acceleration Card (PAC) D5005 (previously known as Intel® FPGA Programmable Acceleration Card (PAC) with Intel® Stratix® 10 SX FPGA) that has a PCIe Gen3 x16 transfer rate, memory transfer with prepinning achieves approximately 12 GB/s in half-duplex and 21 GB/s in full-duplex. The following is an example of how to copy the prepinned memory to the device global memory when using such a board:

intel::fpga_selector device_selector;
auto device_queue = queue(device_selector);
int* data = malloc_host<int>(1024, device_queue);
… // initialize the data
int* data_device = malloc_device<int>(1024, device_queue);
device_queue.template copy<int>(data_device, data, 1024);
RESTRICTION:
  • Most BSPs implement the Unified Shared Memory (USM) call malloc_host() using prepinned memory. Hence, a prepinned memory is available only on devices that support USM host allocation.
  • SYCL USM host allocations are only supported by some BSPs, such as the Intel® FPGA Programmable Acceleration Card (PAC) D5005 (previously known as Intel® FPGA Programmable Acceleration Card (PAC) with Intel® Stratix® 10 SX FPGA). Check with your BSP vendor to see if they support SYCL USM host allocations.

Pinned memory is a scarce resource on the system, so carefully consider which buffers you want to pin to avoid exceeding the system limit. In addition, pinning itself is an expensive operation, so for optimal performance, ensure that the creation of pinned buffers takes place outside the main compute loop.