AN 870: Stencil Computation Reference Design
AN 870: Stencil Computation Reference Design
In all versions of a stencil code, a specific algorithm is applied to every element in a matrix in a sweep. Newly calculated values are piped forward into the next sweep. Cells within the same sweep have no dependencies and are calculated in parallel. Sweeps can also be calculated in parallel if the results from one sweep are continuously piped into the next.
 Large internal and external bandwidth
 A highly pipelined and connected fabric that allows the massive parallelization inherent in a stencil code to be exploited to achieve minimal execution times
FourPoint 2D Stencil / 2D Jacobi Iteration Algorithm
In the 1D representation of a 2D matrix, assuming the length of a row is N and the number of columns is M , the 4point neighborhood/stencil of any individual point is given by : {xN,x+N,x1,x+1}, where all values outside of the boundary of the matrix are assumed to be 0. Assume we have matrix A with values {x_{0}, x_{1}, x_{2}, x_{3}, …}, where x is in the range 0 to NxM . The matrix initially begins at timestep T_{0}. The values at T_{1} are defined as: T_{1}(x_{i}) = 0.25 * (T_{0}(x_{i+1}) + T_{0}(x_{i1}) + T_{0}(x_{i+N}) + T_{0}(x_{iN})).
Now assume we have a 2D representation of a 2D matrix B with values {x_{0}y_{0}, x_{1}y_{0}, x_{0}y_{1}, x_{1}y_{1}, …} where x in the range of 0 to N, and y is in the range of 0 to M. The matrix initially begins at timestep T_{0}. The values at T_{1} are defined as follows:
T_{1}(x_{i}y_{j}) = 0.25 * (T_{0}(x_{i}y_{j1}) + T_{0}(x_{i}y_{j+1}) + T_{0}(x_{i1}y_{j}) + T_{0}(x_{i+1}y_{j}))
The main idea behind this specific stencil pattern is to iteratively update the data within a matrix until the values converge (Jacobi Iteration). Performance is optimized by using caching and feedforward techniques to reduce to reads from global memory (DDR) as much as possible, ideally to 1. Data is typically retrieved from real world measurements, but for the sake of testing all data in the associated reference design is pseudorandom.
OpenCL Design
Stencil computations in general are memorybound applications. The optimizations included in this OpenCL* ^{1} ^{2} reference design seek to leverage the power of an FPGA by both parallelizing as much as possible and using channels/pipes to saturate the bandwidth from onboard DDR to maximize GFLOPS and minimize execution time.
Data is initially read from global memory by the compute unit (CU) named "feeder" and written back to global memory by the "writer"’ CU. Data is read and written in blocks of 16 floats at a time, and the net data bandwidth depends on the frequency of the kernel and supported data read/write rate of the onboard DDR and memory controller.
 A source location where data from the host is sent via PCIe to be calculated
 An output location where data is read by the host after kernel execution
The following diagram captures the flow of data into and out of the kernel system:
Between the feeder and writer CUs is the main part of the kernel system – a series of chained and replicated calculation CUs. Data flows into the first calculation CU in the same order it was read from DDR, and the calculated data is piped into the next CU in the chain in the same order. Data coming into a CU is cached, boundaries are updated, and the result of each calculation is immediately sent into the next kernel. In this way every single CU, and hence every iteration, can be calculated in parallel once enough data has been sent through the chain.
Within each of these calculation CUs is a local memory system called "cache" that is composed of M20k blocks adjacent to the CUs. The cache size must be large enough to store an entire row of the incoming matrix. The height of the matrix can be as large as device memory allows. Matrix attributes are fed forward through the system before stencil computations begin.
Because of the nature of the computation, only 14 of the 16 floating point values being forwarded can be calculated at a time, so boundary conditions are updated between CUs. This is because each element requires both its left and right neighbors to perform a stencil computation, so the first and last elements in the block cannot be calculated. The entire computation requires a total of 3 blocks of 16 floats to be loaded into private registers, the row being calculated and their top/bottom neighbors. After a calculation is performed one block with 14 new and 2 outdated values is piped to the next CU.
Performance
 GPUs

 NVIDIA* Tesla* C2075 companion processor (C2075)
 NVIDIA* GeForce* GTX 760 graphics card (GTX760)
 NVIDIA* GeForce* GTX 960 graphics card (GTX960)
 CPUs

 Intel^{®} Xeon^{®} Processor E51650 V3 (E51650 V3)
 Intel^{®} Core^{®} i74960X Processor Extreme Edition (i74960X)
Thirty kernels were chained together in a feedforward approach in order to perform 30 iterations of the stencil algorithm in parallel. Each individual kernel began execution as soon as it was sent enough information from the previous kernel.
Processor  Execution Time (s) 

A10GX1150  42.455 
C2075  232.4 
GTX760  176.5 
GTX960  111.7 
E51650 V3  258.9 
i74960X  260.2 
 Compiling with a newer version of Intel^{®} Quartus^{®} Prime Design Suite
 Fitting more calculation kernels in the chain
 Using an FPGA device that is larger, faster, or both
 Removing profiling hardware
No. of Kernels  Data set 4088x65536  Data set 4088x32760  Data set 4088x4088  

Exec. Time (ms)  GFLOPS  Stall % (Worst)  Exec. Time (ms)  GFLOPS  Stall % (Worst)  Exec. Time (ms)  GFLOPS  Stall % (Worst)  
1  151.34  7.081040518  29.9  75.75  7.0718352  29.25  9.71  6.8843  34.41 
2  148.89  14.39511951  34.74  74.56  14.369408  34.67  9.62  13.898  38.03 
3  150.16  21.41005605  32.45  75.1  21.399129  32.44  9.2  21.798  30.05 
10  150.33  71.28614861  35.03  75.23  71.207167  34.99  9.51  70.291  35.23 
20  164.24  130.4974028  19.78  82.12  130.46554  20.09  10.38  128.8  33.22 
28  165.77  181.0101394  15.41  82.88  180.97686  15.13  10.45  179.11  22.77 
29  162.62  191.1062322  22.53  81.4  190.84833  22.78  10.42  186.04  15.94 
30  165.8  193.9043435  17.46  82.92  193.81025  17.35  10.43  192.27  12.03 
The following heat maps show the convergence of values after running the kernel. The first pair of images represents the unprocessed raw input. The heat maps illustrate 280x280 grids, where the values in the left maps were initially created with a pseudorandom function, and the values in the right maps were created by looping from 1 to 100.
To calculate the execution time for a system requiring more than 30 iterations, divide the desired number of iterations by 30 multiplied by the time it takes 30 iterations to run () For example, 300 iterations applied to a 4088x65536 data set should take around 1658 ms to run.
The following heat map shows the result of running the initial values through a kernel with 30 chained calculation CUs. You can see the beginnings of convergence emerge.
doi: 10.1109/TPDS.2016.2614981
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582502&isnumber=7894348
Document Revision History for AN 870: Stencil Computation Reference Design
Document Version  Changes 

2018.10.10  Initial release. 