Intel® FPGA SDK for OpenCL Support

Title Details

Intel® FPGA SDK for OpenCL™ Version 18.1 Release Notes

The Intel® FPGA SDK for OpenCL™ Pro Edition Release Notes provides late-breaking information about the Intel FPGA Software Development Kit (SDK) for OpenCL™ Pro Edition and the Intel FPGA Runtime Environment (RTE) for OpenCL Pro Edition Version 18.1.

Intel FPGA SDK for OpenCL Getting Started Guide

This guide describes the procedures you follow to install the Intel FPGA SDK for OpenCL. This document also contains instructions on how to compile an example of the OpenCL application with the Intel FPGA SDK for OpenCL.

Intel FPGA RTE for OpenCL Getting Started User Guide

This guide describes the procedures you follow to install the Runtime Environment (RTE) for OpenCL. This document also contains instructions on how to deploy an OpenCL application with the RTE.

Intel FPGA SDK for OpenCL Programming Guide

This guide provides descriptions, recommendations, and usage information on the Intel FPGA SDK for OpenCL compiler and tools. The Intel FPGA SDK for OpenCL is an OpenCL-based heterogeneous parallel programming environment for Intel FPGAs.

Intel FPGA SDK for OpenCL Best Practices Guide

This guide provides guidance on leveraging the functionalities of the Intel FPGA SDK for OpenCL to optimize your OpenCL applications for Intel FPGAs.

Intel FPGA SDK for OpenCL Cyclone® V SoC Getting Started Guide

This guide describes the procedures you follow to set up and use the Intel FPGA SDK for OpenCL to run an OpenCL application on the Cyclone V SoC Development Kit.

Intel FPGA SDK for OpenCL Custom Platform Toolkit User Guide

This guide outlines the procedure for creating an Intel FPGA SDK for OpenCL Custom Platform.

Intel FPGA SDK for OpenCL Stratix® V Network Reference Platform Porting Guide

This guide describes the procedures and design considerations you can implement to modify the Stratix V Network Reference Platform (s5_net) into your own custom platform for use with the Intel FPGA SDK for OpenCL. This document also contains reference information on the design decisions for s5_net, which makes use of features such as heterogeneous memory buffers and I/O channels to maximize hardware usage on a computing card designed for networking.

Intel FPGA SDK for OpenCL Cyclone V SoC Development Kit Reference Platform Porting Guide

This guide describes the hardware and software design of the Cyclone V SoC Development Kit Reference Platform (c5soc) for use with the Intel FPGA SDK for OpenCL.

Intel FPGA SDK for OpenCL Intel Arria® 10 GX FPGA Development Kit Reference Platform Porting Guide

This guide describes the hardware and software design of the Intel Arria® 10 GX FPGA Development Kit Reference Platform (a10_ref) for use with the Intel FPGA SDK for OpenCL.
Intel FPGA SDK for OpenCL Intel Arria 10 SoC Development Kit Reference Platform Porting Guide This guide describes the hardware and software design of the Intel Arria 10 SoC Development Kit Reference Platform for use with the Intel FPGA SDK for OpenCL.

Intel FPGA SDK for OpenCL Intel Stratix 10 GX Development Kit Reference Platform Porting Guide

This guide describes the hardware and software design of the Intel Stratix 10 GX Development Kit Reference Platform for use with the Intel FPGA SDK for OpenCL.
Title Details
In-Line Acceleration for Streaming Analytics This paper presents a method for implementing FPGA in-line acceleration for streaming analytics.

Accelerating Genomics Research with OpenCL™ and FPGAs (PDF)

This paper describes the acceleration of the GATK’s HaplotypeCaller algorithm using Intel FPGAs programmed with Intel FPGA SDK for OpenCL.

OpenCL on FPGAs for GPU Programmers

This paper highlights the benefits of using Intel FPGAs and the differences between FPGAs and GPUs in executing and optimizing OpenCL kernels.

FPGA Acceleration of Multifunction Printer Image Processing Using OpenCL (PDF)

This paper explores the application of OpenCL to the core Multifunction Printer image processing pipeline with Intel SoC FPGAs.

Implementing FPGA Design with the OpenCL Standard (PDF)

This paper highlights the benefits of utilizing OpenCL with Intel FPGAs over other hardware architectures and traditional methods of FPGA development.

Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms (PDF)

This paper presents a real-time implementation of a fractal compression algorithm in OpenCL. It shows how the algorithm can be efficiently implemented in OpenCL and optimized for multi CPUs, GPUs, and FPGAs. 

Using OpenCL to Evaluate the Efficiency of CPUs, GPUs, and FPGAs for Information Filtering (PDF)

This paper explores techniques that allow programmers to efficiently use FPGAs at a level of abstraction that is closer to traditional software-centric approaches by using OpenCL.

40 Gb AES Encryption Using OpenCL and FPGAs (PDF)

This application note illustrates how to perform AES encryption on FPGAs using the OpenCL tool flow.

Is Intel® FPGA SDK for OpenCL Ready for Business? (PDF)

This paper compares the performance and ease of use of using OpenCL with Intel FPGAs for valuation of a wide range of financial derivative products with European exercise properties using the Monte Carlo technique. 

OpenCL-Ready High-Performance FPGA Network for Reconfigurable HPC

This paper proposes a high-performance, inter FPGA Ethernet communication using OpenCL and Verilog HDL mixed programming in order to demonstrate the feasibility of enabling on-the-fly offloading computation while performing low-latency data movement. 

Optical Flow and Pedestrian Detection Implemented with OpenCL™

Accelerating Algorithm Performance with OpenCL by Offloading to an FPGA

Object Detection and Recognition with Neural Networks

Ray Tracing Demo using OpenCL on an SoC

Unified Heterogeneous Programmability of OpenCL

Design Examples

The following examples demonstrate how to describe various applications in OpenCL along with their respective host applications, which you can compile and execute on a host with an FPGA board that supports the Intel® FPGA SDK for OpenCL™.

Basic Examples

Design Example Features Benefits Description

Hello World

  • OpenCL™ application programming interface (API) to initialize a device and run a kernel
  • Getting started
This simple design example demonstrates a basic OpenCL kernel containing a printf call and its corresponding host program.

Vector Addition

  • OpenCL API
  • Partition a large problem across multiple devices
  • OpenCL events and event profiling
  • Getting started
This simple design example demonstrates a basic vector addition OpenCL kernel and its corresponding host program.
Multithread
Vector Operation
  • Multithreaded host
  • Advanced host code
Two host threads launch two simultaneous kernels.
OpenCL Library
  • OpenCL library
  • Advanced kernel code
Example designs that use OpenCL libraries containing Verilog and VHDL code to implement custom functions.
Loopback - Host Pipe
  • Host Pipes
  • Single Work-item Kernel
  • Host-Kernel communication
This example design demonstrates communication between the host and the kernel. It loops data from the host to the kernel and back to the host.

High-Performance Computing Platform Examples

Design Example Features Benefits Description
Channelizer
  • Kernel channels
  • Multiple simultaneous kernels
  • Single work-item kernels
  • Performance
  • Getting started with kernel channels
This design example demonstrates a high-performance channelizer design using OpenCL™. The channelizer combines a polyphase filter bank (PFB) with a fast Fourier transform (FFT) to reduce the effects of spectral leakage on the resulting frequency spectrum.
Document Filtering
  • Working with 24-bit integers
  • Performance
This design example demonstrates use of Bloom filter for high-performance document filtering.

 

Finite Difference Computation (3D)

  • Single-precision floating-point optimizations
  • Single work-item kernel
  • Optimizations to minimize redundant memory use
  • Performance
This design example demonstrates a high-performance 3D finite-difference stencil-only computation using OpenCL. It shows how to efficiently describe a sliding window data reuse pattern.
FFT (1D)
  • Single-precision floating-point optimizations
  • Single work-item kernel
  • Performance
This design example demonstrates a high-performance 1D radix-4 complex FFT or inverse fast Fourier transform (IFFT) engine using OpenCL. This example takes advantage of the efficient sliding window data reuse pattern.

 

FFT Off-Chip (1D)

  • Single-precision floating-point optimizations
  • Kernel channels
  • Optimized memory accesses
  • Performance
  • Getting started with kernel channels
This design example is a high-performance implementation of a one million point FFT. Such large FFTs cannot be done completely on the FPGA and this example demonstrates how to efficiently manage the memory accesses.
FFT (2D)
  • Single-precision floating-point optimizations
  • Kernel channels
  • Memory access pattern optimizations
  • Multiple simultaneous kernels
  • Mix of single work-item and NDRange kernels
  • Performance
  • Getting started with kernel channels
This design example demonstrates a high-performance 2D radix-4 complex FFT/IFFT engine using OpenCL. This engine is targeted at large problem sizes (1024x1024 by default) and uses global memory to store the intermediate transposition. One aspect highlighted by this example is how to efficiently perform matrix transposition in global memory.
Gzip Compression
  • Single work-item kernel
  • Stream-like processing
  • Published paper with implementation and results
  • High performance (vs. CPU, RTL, ASIC)
  • Parameterizable performance and compression quality
This design example showcases a high-performance Gzip compression implementation using OpenCL for Intel® FPGAs.
JPEG Decoder
  • Single work-item kernels
  • Kernel channels
  • Overlapping memory transfers and kernel invocations
  • Visual output
  • Scalable Performance
  • Getting started with kernel channels
This design example showcases a higher-performance JPEG decoding solution.
Mandelbrot Fractal Rendering
  • Double-precision floating-point optimizations
  • Multiple device partitioning
  • Visual output
  • Scalable Performance
This design example includes a kernel that implements the Mandelbrot fractal convergence algorithm and displays the results to the screen.
Matrix Multiplication
  • Single-precision floating-point optimizations
  • Local memory buffering
  • Compiler optimizations
  • Multiple device execution
  • Scalable performance
  • Getting started with optimization methods
This example shows the optimization of the fundamental matrix multiplication operation using loop tiling to take advantage of the data reuse inherent in the computation.
Monte Carlo Black-Scholes Asian Options Pricing
  • Double-precision floating-point optimizations
  • Kernel channels
  • Multiple device execution
  • Multiple simultaneous kernels
  • Scalable
  • Power-efficient performance
  • Getting started with kernel channels
This design example implements the Monte Carlo Black-Scholes simulation for Asian option pricing. This example shows how to run multiple kernels simultaneously, with each performing different parts of the simulation (random number generation, path simulation, and accumulation) and communicating using our channels vendor extension.
Sobel Filter
  • Integer arithmetic
  • Single work-item kernel
  • Efficient 2D sliding window line buffer
  • Visual output
  • Scalable performance
This design example demonstrates a seamless software solution of a Sobel filter in OpenCL to perform edge detection on an image and display the resulting filtered image on the screen.
Time-Domain FIR Filter
  • Single-precision floating-point optimizations
  • Efficient 1D sliding window buffer implementation
  • Single work-item kernel
  • Optimization methods
  • Performance
  • Getting started with optimization methods
This design implements the time-domain finite impulse response (FIR) filter benchmark from the HPEC Challenge Benchmark Suite. This design example is a great example of how FPGAs can provide far better performance than a GPU architecture for floating-point FIR filters.
Video Downscaling
  • Kernel channels
  • Multiple simultaneous kernels
  • Memory access pattern optimizations
  • Performance
  • Getting started with kernel channels
This design example implements a video downscaler that takes 1080p input video and outputs 720p video at 110 frames per second. This example uses multiple kernels to efficiently read from and write to global memory.

Network Platform Examples

Design Example Features Benefits Description
OPRA FAST Parser
  • Single work-item kernel
  • I/O channels
  • Low latency
  • 10G link saturation
This design example demonstrates a streaming parser commonly used in high-frequency trading algorithms. The parser accepts an OPRA FAST data stream and decompresses the fields for use upstream. It illustrates how you can process streaming messages efficiently to achieve 10G link saturation.

Cyclone® SoC Platform Examples

Design Example Features Benefits Description
Multifunction Printer Error Diffusion
  • Single work-item kernel
  • Sliding window design pattern
  • Part of a Multifunction Printer system
  • Performance
This design is part of core printer pipeline. It implements a variant of Floyd Steinberg error diffusion algorithm. The kernel takes a CMYK image and produces an equivalent image with every pixel half-toned. Such an output is the final stage of image processing inside a printer before it is sent to the laser system. The FPGA Acceleration of Multifunction Printer Image Processing Using OpenCL™ white paper is also available for this example.
Optical Flow
  • Single work-item kernel
  • Sliding window design pattern
  • Resource usage reduction techniques
  • Visual output
  • Performance
This design example is an OpenCL implementation of the Lucas Kanade optical flow algorithm. A dense, non-iterative, and non-pyramidal version with a window size of 52x52 is shown to run at over 80 frames per second on the Cyclone® V SoC Development Kit.

To get you up and running quickly, we offer a number of platforms (as listed below) created both in-house and by our design partners that support the Intel FPGA SDK for OpenCL.

Boards Application Area Features Provider
Intel® Programmable Acceleration Card with Intel Arria® 10 GX FPGA High Performance Computing
  • Intel Arria 10 GX FPGA
  • 10AX115N2F40E2LG
  • High-performance, multi gigabit SERDES transceivers up to 15 Gbps
  • 1,150 K logic elements available (-2L speed grade)
  • 53 Mb of embedded memory
  • 8 GB DDR4 memory banks with error correction code (ECC) (2 banks)
  • 1 Gb (128 MB) flash memory
  • PCIe x8 Gen3 electrical, x16 mechanical 
  • USB 2.0 interface for debug and programming of FPGA and flash memory
  • 1X QSFP+ with 4X 10GbE or 40GbE support
  • Standard height, 1/2 length
    • Low-profile option on request
Intel
S5PH-Q PCIe Board Networking
  • High-density Altera Stratix V GX/GS FPGA
  • PCIe x8 interface supporting Gen1, Gen2, or Gen3
  • Dual QSFP+ cages for 40GigE or 10GigE direct to the FPGA for lowest possible latency
  • Up to 16 GB DDR3 SDRAM
  • Up to 72 MB QDRII/II+
  • Two SATA connectors
  • Timestamping support
  • Board Management Controller for Intelligent Platform Management
  • Utility I/O includes: USB 2.0, RS-232 and JTAG
BittWare

A10PL4 PCIe Board
High Performance Computing
  • Altera Arria 10 GX FPGA
  • PCIe x8 interface supporting Gen1, Gen2, or Gen3
  • Dual QSFP cages for 2x 40GbE or 8x 10GbE
  • Memory: up to 32 GB of DDR4 SDRAM with ECC (x72)
  • Board Management Controller for Intelligent Platform Management
  • Precision clock and timing options
  • Utility I/O: USB 2.0

 

BittWare
WARP II High Performance Computing
  • Dual Intel Stratix 10 FPGAs
  • Up to 272 GB DDR4 (136 GB per FPGA)
  • 20 TFLOPs (10 TFLOPs/FPGA)
  • PCIe x16 Gen3
  • 2x QSFP + 40/100GbE
  • L tile: 40GbE operating at 10 Gbps backplane performance
  • H tile: 100GbE operating at 28 Gbps backplane performance
  • Intel Max10 FPGA
  • NXP Semiconductors* K61 microcontroller
  • GPU-sized PCIe form factor
Colorado Engineering
Proc10S High Performance Computing
  • Stratix 10 GX/SX FPGA
  • For SX devices, Quad-core 64-bit Arm* Cortex-A53* MPCore processor
  • PCIe x16 Gen3 or stand-alone
  • Up to 10x 26 Gbps + 8x 17.4 Gbps reconfigurable transceivers (total of 400 Gbps)
  • Form factor: Full-height, double-width, ¾ length PCI Express card
  • Supports up to 12 V/300 W
  • 2x QSFP28, 2x SFP28, and Gidel high-speed connectors
  • Multilevel memory structure (260+ GB):
    • Enhanced MLAB (640b) SRAM (15 Mb)
    • Up to 11,721 M20K (20 Kb) SRAM (229 Mb) at up to 32 Gbps/block
    • 4 GB DDR4 SDRAM onboard memory at a maximum sustained throughput of 108 Gbps
    • 256 GB DDR4 SDRAM (2x RDIMM banks) for maximum sustained throughput of 400 Gbps
    • Configuration Flash, Serial Flash (SPI), and Serial EEPROM
  • Supported by Gidel’s OpenCL BSP and HLS (I++) ASP based on Intel’s SDK
Gidel

520N Network Acceleration Card

Networking
  •  Intel Stratix 10 F1760 NF43 package
  • 16-lane PCI Express Gen3
  • DDR4 SDRAM Memory
    • Four banks of DDR4 SDRAM x 72 bits
      8 GB per bank
         (32 GB total / 64 GB version also available)
    • Transfer Rate: 2400 MT/s
  • Four 100/40/25/10G QSFP28 Network Ports
    • L-TILE: Support up to two 100G network ports
    • H-TILE: Supports up to four 100G network ports
    • Network recovered clocking supported
Nallatech

Attila Instant-DevKit Arria 10 GX FMC PCIe board
High Performance Computing
  • FPGA Arria 10 GX 10AX115N4F40I3SG (1150K LEs in F40 package FBGA / production device)
  • Onboard JTAG configuration circuitry to enable configuration over onboard USB blaster
  • JTAG header provided to program the MAXV and access to the ARRIA 10.
  • Fast passive parallel (FPP) configuration via MAX®10 device and flash memory
  • Active Serial (AS) configuration via 128 MB (1024 Mb) Quad serial SPI NOR Flash Memory
  • AS configuration for CvP support
  • DDR4 SODIMM interface (up to 16 GB, speeds up to 1200 MHz/2400 Mbps, 72b width supports ECC and Non ECC)
  • 128 MB (1024 Mb) Quad SPI NOR Flash Memory
  • 32Kb I2C EEPROM
  • 1 x QSFP+ optical cage (4 XCVR: 12.5 Gbps per link)*
  • One serial over USB Bridge High-speed link (through USB Hub) on front µUSB connector
  • 1 x PCIe edge connector for Gen3 x8 (32 Gbps)
REFLEX CES
DE5-Net FPGA Development Kit HIgh Performance Computing
  • Altera Stratix® V GX FPGA (5SGXEA7N2F45C2)
  • n-Board USB Blaster II or JTAG header for FPGA programming
  • Fast passive parallel (FPP x32) configuration via MAX II CPLD and flash memory
  • Two Independent DDR3 SODIMM Socket, Up to 8 GB, 800 MHz or 4 GB, 933 MHz for each socket
  • Four Independent 550 MHz SRAM, 18-bit data bus, and 72 Mb for each
  • 256 MB Flash Memory
  • Four SFP+ connectors
  • PCI Express (PCIe) x8 edge connector (includes Windows PCIe drivers)
  • One RS-422 expansion header

While it is convenient if the architecture of the FPGA accelerator you want falls into one of these existing categories, it is not required. These reference platforms are a starting point to aid in building your own custom FPGA. Start with the existing SoC or network platform, and simply remove or modify the component interfaces for the ones you prefer and rebuild it. This uses traditional FPGA design to create the I/O ring for the OpenCL™ kernels to communicate with the I/O interfaces that will be on your custom board.

To build your own custom FPGA accelerator board, you will need a few things. To start building a custom board support package from a blank template, start with the custom platform toolkit.

Documentation

Custom Platform Toolkit: Windows* or Linux* downloads

  • Raw template for a platform
  • Board-test kernels to exercise the I/O interfaces
  • MMD header file to get started on building drivers
  • HPC platform migration text file (from version 13.1)

To start with an existing platform and modify it, here are the current reference platforms available.

Intel Stratix® 10 GX FPGA Development Kit Board Support Package Reference Design

Intel Arria® 10 GX FPGA Development Kit Board Support Package Reference Design

Stratix V Network Board Support Package Reference Design: s5_net (w/ PLDA UDP stack):

Cyclone® V SoC Board Support Package Reference Design:

Intel Arria 10 Custom Platform for OpenCL

Need Help?

Intel recommends the following certified OpenCL board support service providers that can assist you in the development of an OpenCL board support package (BSP) for your custom platforms:

Optimization Training

 

OpenCL™ Coding Optimizations for Intel® Stratix® 10 Devices (23 minutes)

In this course, we will cover how the offline kernel compiler of the Intel® FPGA SDK for OpenCL™ optimizes OpenCL kernel code for optimal performance on Intel® Stratix® 10 FPGAs and how to use recommended coding constructs to enable these optimizations. 

 

OpenCL™ Optimization Techniques: Secure Hash Algorithm (SHA-1) Example (7 minutes)
This training provides a simple overview of the optimization methodology one would take when trying to optimize their OpenCL implementation for an FPGA using the Secure Hash Algorithm (SHA-1) as an example.
 

OpenCL Optimization Techniques: Image Processing Algorithm Example (8 minutes)
This training provides a simple overview of an architectural optimization approach for targeting OpenCL on an FPGA for image processing algorithms.
 

Single-Threaded vs. Multi-Threaded Kernels (17 minutes)
Understand the differences between loop pipelining and parallel threads, and know when to use single-threaded (Task) and multithreaded (NDRange) pipelining.
 

Optimization and Emulation Flow in Intel® FPGA for OpenCL (6 minutes)
See how you can optimize your FPGA-accelerated applications with the emulator and detailed optimization report features.
 

How to Do Reductions (PDF)
 

Being Careful with Memory Access Part 1 (PDF)
 

Being Careful with Memory Access Part 2 (PDF)
 

Optimizing OpenCL for Intel FPGAs (2 days)
This instructor-led training focuses on writing kernel functions that are optimized for Intel FPGAs, including hands-on exercises.
 

OpenCL Training Courses

Introduction to FPGA Acceleration for Software Programmers Using OpenCL
This training describes ways that you can use OpenCL to target an FPGA to create custom accelerated systems with an average of one fifth the power of competing accelerators, trends that are making FPGAs an important resource for accelerating software execution, and how OpenCL makes them accessible to software developers.
 

FPGA vs GPGPU (21 minutes)
Watch this short video to learn how FPGAs provide power-efficient acceleration with far less restrictions and far more flexibility than GPGPUs. We will compare and contrast the approach to solving problems by leveraging this flexibility compared to the fixed architecture of the GPGPU.
 

OpenCL on Intel SoC FPGA (Linux Host)
Part 1 – Tools Download and Setup (5 minutes)
Part 2 – Running the Vector Add Example with the Emulator (4 minutes)
Part 3 – Kernel and Host Code Compilation for SoC FPGA (4 minutes)
Part 4 – Setup of the Runtime Environment (7 minutes)

These training courses show you how to get started with OpenCL on an SoC in a Linux* environment.
 

Introduction to Parallel Computing with OpenCL (30 minutes)
Get an overview of the OpenCL standard and the advantages of using Intel's OpenCL solution.
 

Writing OpenCL Programs for Intel FPGAs (1 hour)
Understand the basics of the OpenCL standard and learn to write simple programs.
 

Running OpenCL on Intel FPGAs (30 minutes)
Get to know the Intel FPGA SDK for OpenCL and learn to compile and run OpenCL programs on Intel FPGAs.
 

Building Custom Platforms for Intel FPGA SDK for OpenCL (1 hour)
Learn how to create a custom board support package for use with your board and the Intel FPGA SDK for OpenCL.
 

Introduction to OpenCL for Intel FPGAs (1 day)
Get an overview of parallel computing, the OpenCL standard, and the OpenCL for FPGA design flow in this instructor-led training. The focus of the training is not on writing kernels, but rather going over the FPGA-specific portion of creating an OpenCL environment for hardware acceleration.

1Product is based on a published Khronos Specification, and has passed the Khronos Conformance Testing Process. Current conformance status can be found www.khronos.org/conformance.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Intel and Quartus are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.