With the Intel® FPGA SDK for OpenCL™ Offline Compiler technology, you do not need to change your kernel to fit it optimally into a fixed hardware architecture. Instead, the offline compiler customizes the hardware architecture automatically to accommodate your kernel requirements.
In general, you should optimize a kernel that targets a single compute unit first. After you optimize this compute unit, increase the performance by scaling the hardware to fill the remainder of the FPGA. The hardware footprint of the kernel correlates with the time it takes for hardware compilation. Therefore, the more optimizations you can perform with a smaller footprint (that is, a single computing unit), the more hardware compilations you can perform in a given amount of time.
OpenCL Optimization for Intel FPGAs
To optimize the implementation of your design and get the maximum performance, understand your theoretical maximum performance and understand what your limitations are. Follow these steps:
- Start with a simple known good functional implementation.
- Use an emulator to validate the functionality.
- Remove or minimize the pipeline stalls that are reported with the optimization report.
- Plan memory access for optimal memory bandwidth.
- Use a profiler to debug performance issues.
The Profiler gives more insight into the system performance, which gives you direction to further optimize the algorithm in usage of the memory.
Remember that for FPGAs, the more resources that can be allocated, the more unrolling, parallelization, and higher performance can be attained.
Helpful Reports and Resources for Optimization
There are a number of system generated reports available to users. These reports give insight into the code, resource usage, and hints on where to focus to further improve the performance:
Memory Optimization
Understanding memory systems is crucial to efficiently implement an application using OpenCL.
Global Memory Interconnect
Unlike a GPU, an FPGA can build any custom load-store unit (LSU) that is most optimal for your application. As a result, your ability to write OpenCL code that selects the ideal LSU types for your application might help improve the performance of your design significantly.
For more information, refer to the Global Memory Interconnect section of the Intel FPGA SDK for the OpenCL Best Practices Guide.
Local Memory
Local memory is a complex system. Unlike typical GPU architecture where there are different levels of caches, an FPGA implements local memory in dedicated memory blocks inside the FPGA. For more information, refer to the Local Memory section of the Intel FPGA SDK for OpenCL Best Practices Guide.
There are a number of ways memory used can be optimized for improving the overall performance. For more information on some of the key techniques, refer to the Allocating Aligned Memory section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on the strategies to improve memory access efficiency, refer to the Strategies for Improving Memory Access Efficiency section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Pipelines
Understanding pipelines is crucial for leveraging the best performance of your implementation. Efficient use of pipelines directly improves the performance throughput. For more details, refer to the Pipelines section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on data transfer, refer to the Transferring Data Via Intel FPGA SDK for OpenCL Channels or OpenCL Pipes section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Stall, Occupancy, Bandwidth
Profile your kernel to identify performance bottlenecks. For more information on how profiling information helps you identify poor memory or channel behaviors that lead to unsatisfactory kernel performance, refer to the Profiling Your Kernel to Identify Performance Bottlenecks section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Loop Optimization
Some techniques for optimizing the loops are:
For some tips on removing loop-carried dependencies in various scenarios for a single work item kernel, refer to the Removing Loop-Carried Dependency section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on optimizing floating-point operations, refer to the Optimizing Floating-Point Operations section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Area Optimization
Area usage is an important design consideration if your OpenCL kernels are executable on FPGAs of different sizes. When you design your OpenCL application, Intel recommends that you follow certain design strategies for optimizing hardware area usage.
Optimizing kernel performance generally requires additional FPGA resources. In contrast, area optimization often results in decreased performance. During kernel optimization, Intel recommends that you run multiple versions of the kernel on the FPGA board to determine the kernel programming strategy that generates the best size versus performance trade-off.
For more information on strategies for optimizing FPGA area usage, refer to the Strategies for Optimizing FPGA Area Usage section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Reference Design Examples
Some design examples that illustrate the optimization techniques are as follow:
This example shows the optimization of the fundamental matrix multiplication operation using loop tiling to take advantage of the data reuse inherent in the computation.
This example illustrates:
- Single-precision floating-point optimizations
- Local memory buffering
- Compile optimizations (loop unrolling, num_simd_work_items attribute)
- Floating-point optimizations
- Multiple device execution
This design example implements the time-domain finite impulse response (FIR) filter benchmark from the HPEC Challenge Benchmark Suite. For more information, refer to the Time-Domain Finite Impulse Response Filter Bank page.
This design is a great example of how FPGAs can provide far better performance than a GPU architecture for floating-point FIR filters.
This example illustrates:
- Single-precision floating-point optimizations
- Efficient 1D sliding window buffer implementation
- Single work-item kernel optimization methods
This design example implements a video downscaler that takes 1080p input video and outputs 720p video at 110 frames per second. This example uses multiple kernels to efficiently read from and write to global memory.
This example illustrates
- Kernel channels
- Multiple simultaneous kernels
- Kernel-to-kernel channels
- Sliding window design pattern
- Memory access pattern optimizations
This design example is an OpenCL implementation of the Lucas Kanade optical flow algorithm. A dense, non-iterative, and non-pyramidal version with a window size of 52x52 is shown to run at over 80 frames per second on the Cyclone® V SoC Development Kit.
This example illustrates:
- Single work-item kernel
- Sliding window design pattern
- Resource usage reduction techniques
- Visual output
Training
Online training specific to OpenCL optimization with design examples are available at:
References