A newer version of this document is available. Customers should click here to go to the newest version.
1. Intel® FPGA SDK for OpenCL™ Overview 2. Intel® FPGA SDK for OpenCL™ Offline Compiler Kernel Compilation Flows 3. Obtaining General Information on Software, Compiler, and Custom Platform 4. Managing an FPGA Board 5. Structuring Your OpenCL Kernel 6. Designing Your Host Application 7. Compiling Your OpenCL Kernel 8. Emulating and Debugging Your OpenCL Kernel 9. Developing OpenCL Applications Using Third-party IDEs 10. Developing OpenCL™ Applications Using Intel® Code Builder for OpenCL™ 11. Intel® FPGA SDK for OpenCL™ Advanced Features A. Support Statuses of OpenCL Features B. Intel FPGA SDK for OpenCL Pro Edition Programming Guide Archives C. Document Revision History of the Intel® FPGA SDK for OpenCL™ Pro Edition Programming Guide
3.1. Displaying the Software Version (version) 3.2. Displaying the Compiler Version (-version) 3.3. Listing the Intel® FPGA SDK for OpenCL™ Utility Command Options (help) 3.4. Listing the Intel® FPGA SDK for OpenCL™ Offline Compiler Command Options (no argument, -help, or -h) 3.5. Listing the Available FPGA Boards and Custom Platforms (-list-boards and -list-board-packages) 3.6. Displaying the Compilation Environment of an OpenCL Binary (env)
4.1. Installing an FPGA Board (install) 4.2. Uninstalling an FPGA Board (uninstall) 4.3. Querying the Device Name of Your FPGA Board (diagnose) 4.4. Running a Board Diagnostic Test (diagnose <device_name>) 4.5. Programming the FPGA Offline or without a Host (program <device_name>) 4.6. Programming the Flash Memory (flash <device_name>)
5.1. Guidelines for Naming the Kernel 5.2. Programming Strategies for Optimizing Data Processing Efficiency 5.3. Programming Strategies for Optimizing Pointer-to-Local Memory Size 5.4. Implementing the Intel® FPGA SDK for OpenCL™ Channels Extension 5.5. Implementing OpenCL Pipes 5.6. Implementing Arbitrary Precision Integers 5.7. Using Predefined Preprocessor Macros in Conditional Compilation 5.8. Declaring __constant Address Space Qualifiers 5.9. Including Structure Data Types as Arguments in OpenCL Kernels 5.10. Inferring a Register 5.11. Enabling Double Precision Floating-Point Operations 5.12. Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels 5.13. Integer Promotion Rules
5.2.1. Unrolling a Loop (unroll Pragma) 5.2.2. Disabling Pipelining of a Loop (disable_loop_pipelining Pragma) 5.2.3. Coalescing Nested Loops 5.2.4. Fusing Adjacent Loops (loop_fuse Pragma) 5.2.5. Marking Loops to Prevent Automatic Fusion (nofusion Pragma) 5.2.6. Specifying a Loop Initiation interval (II) 5.2.7. Loop Concurrency (max_concurrency Pragma) 5.2.8. Loop Speculation (speculated_iterations Pragma) 5.2.9. Loop Interleaving Control (max_interleaving Pragma) 5.2.10. Floating Point Optimizations (fp contract and fp reassociate Pragma) 5.2.11. Specifying Work-Group Sizes 5.2.12. Specifying Number of Compute Units 5.2.13. Specifying Number of SIMD Work-Items 5.2.14. Specifying the private_copies Memory Attribute 5.2.15. Specifying the use_stall_enable_clusters Cluster-control Attribute
5.4.1. Overview of the Intel® FPGA SDK for OpenCL™ Channels Extension 5.4.2. Channel Data Behavior 5.4.3. Multiple Work-Item Ordering for Channels 5.4.4. Restrictions in the Implementation of Intel® FPGA SDK for OpenCL™ Channels Extension 5.4.5. Enabling the Intel® FPGA SDK for OpenCL™ Channels for OpenCL Kernel
220.127.116.11. Declaring the Channel Handle 18.104.22.168. Implementing Blocking Channel Writes 22.214.171.124. Implementing Blocking Channel Reads 126.96.36.199. Implementing I/O Channels Using the io Channels Attribute 188.8.131.52. Emulating I/O Channels 184.108.40.206. Use Models of Intel® FPGA SDK for OpenCL™ Channels Implementation 220.127.116.11. Implementing Buffered Channels Using the depth Channels Attribute 18.104.22.168. Enforcing the Order of Channel Calls
22.214.171.124. Ensuring Compatibility with Other OpenCL SDKs 126.96.36.199. Declaring the Pipe Handle 188.8.131.52. Implementing Pipe Writes 184.108.40.206. Implementing Pipe Reads 220.127.116.11. Implementing Buffered Pipes Using the depth Attribute 18.104.22.168. Implementing I/O Pipes Using the io Attribute 22.214.171.124. Enforcing the Order of Pipe Calls
6.1. Host Programming Requirements 6.2. Allocating OpenCL Buffers for Manual Partitioning of Global Memory 6.3. Triggering Collection Profiling Data During Kernel Execution 6.4. Accessing Custom Platform-Specific Functions 6.5. Modifying Host Program for Structure Parameter Conversion 6.6. Managing Host Application 6.7. Allocating Shared Memory for OpenCL Kernels Targeting SoCs 6.8. Sharing Multiple Devices Across Multiple Host Programs
126.96.36.199. Linking Your Host Application to the Khronos ICD Loader Library 188.8.131.52. Displaying Flags for Compiling Host Application (compile-config) 184.108.40.206. Displaying Paths to OpenCL Host Runtime and MMD Libraries (ldflags) 220.127.116.11. Listing OpenCL Host Runtime and MMD Libraries (ldlibs) 18.104.22.168. Displaying Information on OpenCL Host Runtime and MMD Libraries (link-config or linkflags)
7.1. Compiling Your Kernel to Create Hardware Configuration File 7.2. Compiling Your Kernel without Building Hardware (-c) 7.3. Compiling and Linking Your Kernels or Object Files without Building Hardware (-rtl) 7.4. Specifying the Location of Header Files (-I=<directory>) 7.5. Specifying the Name of an Intel® FPGA SDK for OpenCL™ Offline Compiler Output File (-o <filename>) 7.6. Compiling a Kernel for a Specific FPGA Board and Custom Platform (-board=<board_name>) and (-board-package=<board_package_path>) 7.7. Resolving Hardware Generation Fitting Errors during Kernel Compilation (-high-effort) 7.8. Specifying Schedule Fmax Target for Kernels (-clock=<clock_target>) 7.9. Defining Preprocessor Macros to Specify Kernel Parameters (-D<macro_name>) 7.10. Generating Compilation Progress Report (-v) 7.11. Displaying the Estimated Resource Usage Summary On-Screen (-report) 7.12. Suppressing Warning Messages from the Intel® FPGA SDK for OpenCL™ Offline Compiler (-W) 7.13. Converting Warning Messages from the Intel® FPGA SDK for OpenCL™ Offline Compiler into Error Messages (-Werror) 7.14. Removing Debug Data from Compiler Reports and Source Code from the .aocx File (-g0) 7.15. Disabling Burst-Interleaving of Global Memory (-no-interleaving=<global_memory_type>) 7.16. Forcing Ring Interconnect for Global Memory (-global-ring) 7.17. Forcing a Single Store Ring to Reduce Area at the Expense of Write Throughput to Global Memory (-force-single-store-ring) 7.18. Forcing Fewer Read Data Reorder Units to Reduce Area at the Expense of Read Throughput to Global Memory (-num-reorder) 7.19. Configuring Constant Memory Cache Size (-const-cache-bytes=<N>) 7.20. Relaxing the Order of Floating-Point Operations (-ffp-reassociate) 7.21. Reducing Floating-Point Rounding Operations (-ffp-contract=fast) 7.22. Speeding Up Your OpenCL Compilation (-fast-compile) 7.23. Compiling Your Kernel Incrementally (-incremental) 7.24. Compiling Your Kernel with Memory Error Correction Coding (-ecc) 7.25. Disabling Hardware Kernel Invocation Queue (-no-hardware-kernel-invocation-queue) 7.26. Modifying the Handshaking Protocol (-hyper-optimized-handshaking) 7.27. Pipelining Loops in Non-task Kernels (-auto-pipeline)
8.1. Setting up the Emulator 8.2. Modifying Channels Kernel Code for Emulation 8.3. Compiling a Kernel for Emulation (-march=emulator) 8.4. Emulating Your OpenCL Kernel 8.5. Debugging Your OpenCL Kernel on Linux 8.6. Limitations of the Intel® FPGA SDK for OpenCL™ Emulator 8.7. Discrepancies in Hardware and Emulator Results 8.8. Emulator Environment Variables 8.9. Extensions Supported by the Emulator 8.10. Emulator Known Issues
11.1.1. Creating Library Objects From OpenCL Code 11.1.2. Understanding RTL Modules and the OpenCL Pipeline 11.1.3. Packaging an OpenCL Helper Function File for an OpenCL Library 11.1.4. Packaging an RTL Component for an OpenCL Library 11.1.5. Verifying the RTL Modules 11.1.6. Specifying an OpenCL Library when Compiling an OpenCL Kernel 11.1.7. Debugging Your OpenCL Library Through Simulation (Preview) 11.1.8. Using an OpenCL Library that Works with Simple Functions (Example 1) 11.1.9. Using an OpenCL Library that Works with External Memory (Example 2) 11.1.10. OpenCL Library Command-Line Options
22.214.171.124. Overview: Intel FPGA SDK for OpenCL Pipeline Approach 126.96.36.199. Integration of an RTL Module into the Intel FPGA SDK for OpenCL Pipeline 188.8.131.52. Stall-Free RTL 184.108.40.206. RTL Module Interfaces 220.127.116.11. Avalon Streaming Interface 18.104.22.168. RTL Reset and Clock Signals 22.214.171.124. Object Manifest File Syntax of an RTL Module 126.96.36.199. Interaction between RTL Module and External Memory 188.8.131.52. Order of Threads Entering an RTL Module 184.108.40.206. OpenCL C Model of an RTL Module 220.127.116.11. Potential Incompatibility between RTL Modules and Partial Reconfiguration
A.1.1. OpenCL 1.0 C Programming Language Implementation A.1.2. OpenCL C Programming Language Restrictions A.1.3. Argument Types for Built-in Geometric Functions A.1.4. Numerical Compliance Implementation A.1.5. Image Addressing and Filtering Implementation A.1.6. Atomic Functions A.1.7. Embedded Profile Implementation
- 3.4. Listing the Intel® FPGA SDK for OpenCL™ Offline Compiler Command Options (no argument, -help, or -h)
- 7.5. Specifying the Name of an Intel® FPGA SDK for OpenCL™ Offline Compiler Output File (-o <filename>)
- 7.6. Compiling a Kernel for a Specific FPGA Board and Custom Platform (-board=<board_name>) and (-board-package=<board_package_path>)
- 7.13. Converting Warning Messages from the Intel® FPGA SDK for OpenCL™ Offline Compiler into Error Messages (-Werror)
- 7.17. Forcing a Single Store Ring to Reduce Area at the Expense of Write Throughput to Global Memory (-force-single-store-ring)
- 7.18. Forcing Fewer Read Data Reorder Units to Reduce Area at the Expense of Read Throughput to Global Memory (-num-reorder)
8.7. Discrepancies in Hardware and Emulator Results
When you emulate a kernel, your OpenCL system might produce results different from that of the kernel compiled for hardware. You can further debug your kernel before you compile for hardware by running your kernel through simulation.
Warning: These discrepancies usually occur when the Intel FPGA SDK for OpenCL Emulator is unable to model some aspects of the hardware computation accurately, or when your program relies on an undefined behavior.
The most common reasons for differences in emulator and hardware results are as follows:
- Your OpenCL kernel code is using the #pragma ivdep directive. The Emulator does not model your OpenCL system when a true dependence is broken by a pragma ivdep directive. During a full hardware compilation, you observe this as an incorrect result.
- Your OpenCL kernel code is relying on uninitialized data. Examples of uninitialized data include uninitialized variables and uninitialized or partially initialized global buffers, local arrays, and private arrays.
- Your OpenCL kernel code behavior depends on the precise results of floating point operations. The Emulator uses floating point computation hardware of the CPU whereas the hardware run uses floating point cores implemented as FPGA cores. The use of -ffp-reassociate aoc option in your OpenCL kernel code might change the order of operations leading to further divergence in the floating point results.
Note: The OpenCL standard allows one or more least significant bits of floating point computations to differ between platforms, while still being considered correct on both such platforms.
- Your OpenCL kernel code behavior depends on the order of channel accesses in different kernels. The emulation of channel behavior has limitations, especially for conditional channel operations where the kernel does not call the channel operation in every loop iteration. In such cases, the Emulator might execute channel operations in an order different from that on the hardware.
- Your OpenCL kernel or host code is accessing global memory buffers out-of-bounds.
- Uninitialized memory read and write behaviors are platform-dependent. Verify sizes of your global memory buffers when using all addresses within kernels, allocating clCreateBuffer function call, and transferring clEnqueueReadBuffer and clEnqueueWriteBuffer function calls.
- You can use software memory leak detection tools, such as Valgrind, on the emulated version of your OpenCL system to analyze memory related problems. Absence of warnings from such tools does not mean the absence of problems. It only means that the tool could not detect any problem. In such a scenario, Intel recommends manual verification of your OpenCL kernel or host code.
- Your OpenCL kernel code is accessing local or private variables out-of-bounds. For example, accessing a local or private array out-of-bounds or accessing a private variable after it has gone out of scope.
Attention: In software terms, these issues are referred to as stack corruption issues because accessing variables out-of-bounds usually affects unrelated variables located close to the variable being accessed on a software stack. Emulated OpenCL kernels are implemented as regular CPU functions, and have an actual stack that can be corrupted. When targeting hardware, no stack exists and hence, the stack corruption issues are guaranteed to manifest differently. You may use memory leak analyzer tools, such as Valgrind, when a stack corruption is suspected. However, stack related issues are usually difficult to identify. Intel recommends manual verification of your OpenCL kernel code to debug a stack related issue.
- Your OpenCL kernel code is using shifts that are larger than the type being shifted. For example, shifting a 64-bit integer by 65 bits. According to the OpenCL specification version 1.0, the behavior of such shifts is undefined.
- When you compile your OpenCL kernel for emulation, the default channel depth is different from the default channel depth generated when your kernel is compiled for hardware. This difference in channel depths might lead to scenarios where execution on the hardware hangs while kernel emulation works without any issue. Refer to Emulating Channel Depth for information on how to fix the channel depth difference.
- In terms of ordering the printed lines, the output of the printf function might be ordered differently on the Emulator and hardware. This is because, in the hardware, printf data is stored in a global memory buffer and flushed from the buffer only when the kernel execution is complete, or when the buffer is full. In the Emulator, the printf function uses the x86 stdout.
- If you perform an unaligned load/store through upcasting of types, the FPGA and emulator might produce different results. A load/store of this type is undefined in the C99 specification.
For example, the following operation might produce unexpected results:
int tmp = *((int *) (my_ptr + 5));
Did you find the information on this page useful?