Intel® FPGA SDK for OpenCL™: Intel® Cyclone® V SoC Development Kit Reference Platform Porting Guide

ID 683435
Date 11/06/2017
Public
Document Table of Contents

1.3. Software Support for Shared Memory

Shared physical memory between FPGA and CPU is the preferred memory for OpenCL™ kernels running on SoC FPGAs. Because the FPGA accesses shared physical memory, as opposed to shared virtual memory, it does not have access to the CPU's page tables that map user virtual addresses to physical page addresses.

With respect to the hardware, OpenCL kernels access shared physical memory through direct connection to the HPS DDR hard memory controller. With respect to the software, support for shared physical memory involves the following considerations:

  1. Typical software implementations for allocating memory on the CPU (for example, the malloc() function) cannot allocate a memory region that the FPGA may use. Memory that the malloc() function allocates is contiguous in the virtual memory address space, but any underlying physical pages are unlikely to be contiguous physically. As such, the host must be able to allocate physically-contiguous memory regions. However, this ability does not exist in user-space applications on Linux. Therefore, the Linux kernel driver must perform the allocation.

  2. The OpenCL SoC FPGA Linux kernel driver includes the mmap() function to allocate shared physical memory and map it into the user space. The mmap() function uses the standard Linux kernel call dma_alloc_coherent() to request physically-contiguous memory regions for sharing with a device.

  3. In the default Linux kernel, dma_alloc_coherent() does not allocate physically-contiguous memory more than 0.5 megabytes (MB) in size. To allow dma_alloc_coherent() to allocate large amounts of physically-contiguous memory, enable the contiguous memory allocator (CMA) feature of the Linux kernel and then recompile the Linux kernel.

    For the Cyclone V SoC Development Kit Reference Platform, CMA manages 512 MB out of 1 GB of physical memory. You may increase or decrease this value, depending on the amount of shared memory that the application requires. The dma_alloc_coherent() call might not be able to allocate the full 512 MB of physically-contiguous memory; however, it can routinely obtain approximately 450 MB of memory.

  4. The CPU can cache memory that the dma_alloc_coherent() call allocates. In particular, write operations from the host application are not visible to the OpenCL kernels. The mmap() function in the OpenCL SoC FPGA Linux kernel driver also contains calls to the pgprot_noncached() or remap_pf_range() function to disable caching for this region of memory explicitly.

  5. After the dma_alloc_coherent() function allocates the physically-contiguous memory, the mmap() function returns the virtual address to the beginning of the range, which is the address span of the memory you allocate. The host application requires this virtual address to access the memory. However, the OpenCL kernels require physical addresses. The Linux kernel driver keeps track of the virtual-to-physical address mapping. You can map the physical addresses that mmap() returns to actual physical addresses by adding a query to the driver.

    The aocl_mmd_shared_mem_alloc() MMD application programming interface (API) call incorporates the following queries:

    1. The mmap() function that allocates memory and returns the virtual address.
    2. The extra query that maps the returned virtual address to physical address.

    The aocl_mmd_shared_mem_alloc() MMD API call then returns two addresses—the actual returned address is the virtual address, and the physical address goes to device_ptr_out.

    Note: The driver can only map the virtual addresses that the mmap() function returns to physical addresses. If you request for the physical address of any other virtual pointer, the driver returns a NULL value.
Warning: The Intel® FPGA SDK for OpenCL™ runtime libraries assume that the shared memory is the first memory listed in the board_spec.xml file. In other words, the physical address that the Linux kernel driver obtains becomes the Avalon® address that the OpenCL kernel passes to the HPS SDRAM.

With respect to the runtime library, use the clCreateBuffer() call to allocate the shared memory as a device buffer in the following manner:

  • For the two-DDR board variant with both shared and nonshared memory, clCreateBuffer() allocates shared memory if you specify the CL_MEM_USE_HOST_PTR flag. Using other flags causes clCreateBuffer() to allocate buffer in the nonshared memory.
  • For the one-DDR board variant with only shared memory, clCreateBuffer() allocates shared memory regardless of which flag you specify.

Currently, 32-bit Linux support on ARM® CPU governs the extent of shared memory support in the SDK runtime libraries. In other words, runtime libraries compiled to other environments (for example, x86_64 Linux or 64-bit Windows) do not support shared memory.

C5soc did not implement heterogeneous memory to distinguish between shared and nonshared memory for the following reasons:

  1. History—Heterogeneous memory support was not available when shared memory support was originally created.
  2. Uniform interface—Because OpenCL is an open standard, Intel® maintains consistency between heterogeneous computing platform vendors. Therefore, the same interface as other board vendors' architectures is used to allocate and use shared memory.