This benchmark demonstrates an OpenCL™ implementation of a 1D fast Fourier transform (1D FFT) on Intel® FPGAs. The benchmark can process up to 16 million complex single-precision floating point values and supports dynamically changing data size.

The algorithm used to process such large data sets has six stages. For example, assume we want to process 1 million points:

- Treating 1M points as 1K x 1K matrix, read it from external memory and transpose it on the fly.
- Run 1K 1D FFT on all the rows (of transposed matrix).
- Multiply resulting values by adjustment twiddle factors.
- Transpose the matrix and write to temporary buffer in external memory.
- Run 1K 1D FFT on all the rows.
- Transpose the matrix and write output to external memory.

The whole system consists of three kernels connected by channels. The set of three kernels is enqueued twice by the host to do the full computation. First enqueue performs steps 1-4 above, the second enqueue does steps 5-6. This is essentially a 2D FFT core with extra transposition and a twiddle multiplication.

The code is easily parameterized to support different FFT sizes as well as different performance requirements.

### FFT Performance

The performance of the core depends on number of points processed in parallel, data layout used, and number and speed of external memory. Measurements below were done on BittWare S5-PCIe-HQ D8 with two DDR3-1600s. Measurements were done on 1M point FFT for 8 points in parallel and 4M FFT for 4 points in parallel.