Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

ID 683176
Date 9/24/2018
Public
Document Table of Contents

7.3.2. Preloading Data to Local Memory

Local memory is considerably smaller than global memory, but it has significantly higher throughput and much lower latency. Unlike global memory accesses, the kernel can access local memory randomly without any performance penalty. When you structure your kernel code, attempt to access the global memory sequentially, and buffer that data in on-chip local memory before your kernel uses the data for calculation purposes.

The implements OpenCL™ local memory in on-chip memory blocks in the FPGA. On-chip memory blocks have two read and write ports, and they can be clocked at an operating frequency that is double the operating frequency of the OpenCL kernels. This doubling of the clock frequency allows the memory to be “double pumped,” resulting in twice the bandwidth from the same memory. As a result, each on-chip memory block supports up to four simultaneous accesses.

Ideally, the accesses to each bank are distributed uniformly across the on-chip memory blocks of the bank. Because only four simultaneous accesses to an on-chip memory block are possible in a single clock cycle, distributing the accesses helps avoid bank contention.

This banking configuration is usually effective; however, the offline compiler must create a complex memory system to accommodate a large number of banks. A large number of banks might complicate the arbitration network and can reduce the overall system performance.

Because the offline compiler implements local memory that resides in on-chip memory blocks in the FPGA, the offline compiler must choose the size of local memory systems at compilation time. The method the offline compiler uses to determine the size of a local memory system depends on the local data types used in your OpenCL code.

Optimizing Local Memory Accesses

To optimize local memory access efficiency, consider the following guidelines:

  • Implementing certain optimizations techniques, such as loop unrolling, might lead to more concurrent memory accesses.
    CAUTION:
    Increasing the number of memory accesses can complicate the memory systems and degrade performance.

  • Simplify the local memory subsystem by limiting the number of unique local memory accesses in your kernel to four or less, whenever possible.

    You achieve maximum local memory performance when there are four or less memory accesses to a local memory system. If the number of accesses to a particular memory system is greater than four, the offline compiler arranges the on-chip memory blocks of the memory system into a banked configuration.

  • If you have function scope local data, the offline compiler statically sizes the local data that you define within a function body at compilation time. You should define local memories by directing the offline compiler to set the memory to the required size, rounded up to the closest value that is a power of two.

  • For pointers to __local kernel arguments, the host assigns their memory sizes dynamically at runtime through clSetKernelArg calls. However, the offline compiler must set these physical memory sizes at compilation time.

    By default, pointers to __local kernel arguments are 16 kB in size. You can specify an allocation size by including the local_mem_size attribute in your pointer declaration.

    For usage information on the local_mem_size attribute, refer to the Specifying Pointer Size in Local Memory section of the Standard Edition Programming Guide.

    Note: clSetKernelArg calls can request a smaller data size than has been physically allocated at compilation time, but never a larger size.

  • When accessing local memory, use the simplest address calculations possible and avoid pointer math operations that are not mandatory.

    recommends this coding style to reduce FPGA resource utilization and increase local memory efficiency by allowing the offline compiler to make better guarantees about access patterns through static code analysis. Complex address calculations and pointer math operations can prevent the offline compiler from creating independent memory systems representing different portions of your data, leading to increased area usage and decreased runtime performance.

  • Avoid storing pointers to memory whenever possible. Stored pointers often prevent static compiler analysis from determining the data sets accessed, when the pointers are subsequently retrieved from memory. Storing pointers to memory almost always leads to suboptimal area and performance results.

  • Create local array elements that are a power of 2 bytes to allow the offline compiler to provide an efficient memory configuration.