AN 870: Stencil Computation Reference Design

ID 683051
Date 10/10/2018

1.3. Performance

The performance3 on an Intel® Arria® 10 GX FPGA Development Kit (A10GX1150) was compared against experimental results reported by the academic paper OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology 4, published May 2017. Data in the paper was collected by running 15360 sweeps of the stencil pattern. The pattern was optimized for the following GPUs and CPUs:
  • NVIDIA* Tesla* C2075 companion processor (C2075)
  • NVIDIA* GeForce* GTX 760 graphics card (GTX760)
  • NVIDIA* GeForce* GTX 960 graphics card (GTX960)
  • Intel® Xeon® Processor E5-1650 V3 (E5-1650 V3)
  • Intel® Core® i7-4960X Processor Extreme Edition (i7-4960X)

Thirty kernels were chained together in a feed-forward approach in order to perform 30 iterations of the stencil algorithm in parallel. Each individual kernel began execution as soon as it was sent enough information from the previous kernel.

If optimized correctly, the kernel can be altered easily to alternate between reading/writing from both global memory objects and run the 30 sweeps as many times as wanted. The execution time in this case is the same as is reported for the non-repeating case. The execution time might be less if other optimizations are applied. The following table outlines the execution time over all 15360 sweeps on a 32768x4096 float data set. Lower execution times are better.
Table 1.  Execution Time Comparison
Processor Execution Time (s)
A10GX1150 42.455
C2075 232.4
GTX760 176.5
GTX960 111.7
E5-1650 V3 258.9
i7-4960X 260.2
The execution times shown in the following chart were recorded using the stencil computation reference design running on an Intel® Arria® 10 GX FPGA Development Kit board. If the following optimizations were to be applied, execution time can be improved significantly. The single largest performance boost would be to upgrade to a larger device, such as moving to a Intel® Stratix® 10.
  • Compiling with a newer version of Intel® Quartus® Prime Design Suite
  • Fitting more calculation kernels in the chain
  • Using an FPGA device that is larger, faster, or both
  • Removing profiling hardware
The following table shows the complete chart of execution times on an Intel® Arria® 10 GX FPGA Development Kit for three data set sizes:
No. of Kernels Data set 4088x65536 Data set 4088x32760 Data set 4088x4088
Exec. Time (ms) GFLOPS Stall % (Worst) Exec. Time (ms) GFLOPS Stall % (Worst) Exec. Time (ms) GFLOPS Stall % (Worst)
1 151.34 7.081040518 29.9 75.75 7.0718352 29.25 9.71 6.8843 34.41
2 148.89 14.39511951 34.74 74.56 14.369408 34.67 9.62 13.898 38.03
3 150.16 21.41005605 32.45 75.1 21.399129 32.44 9.2 21.798 30.05
10 150.33 71.28614861 35.03 75.23 71.207167 34.99 9.51 70.291 35.23
20 164.24 130.4974028 19.78 82.12 130.46554 20.09 10.38 128.8 33.22
28 165.77 181.0101394 15.41 82.88 180.97686 15.13 10.45 179.11 22.77
29 162.62 191.1062322 22.53 81.4 190.84833 22.78 10.42 186.04 15.94
30 165.8 193.9043435 17.46 82.92 193.81025 17.35 10.43 192.27 12.03

The following heat maps show the convergence of values after running the kernel. The first pair of images represents the unprocessed raw input. The heat maps illustrate 280x280 grids, where the values in the left maps were initially created with a pseudorandom function, and the values in the right maps were created by looping from 1 to 100.

To calculate the execution time for a system requiring more than 30 iterations, divide the desired number of iterations by 30 multiplied by the time it takes 30 iterations to run () For example, 300 iterations applied to a 4088x65536 data set should take around 1658 ms to run.

The following heat map shows the result of running the initial values through a kernel with 30 chained calculation CUs. You can see the beginnings of convergence emerge.

3 Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit
4 H. M. Waidyasooriya, Y. Takei, S. Tatsumi and M. Hariyama, "OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology," in IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1390-1402, May 1 2017.

doi: 10.1109/TPDS.2016.2614981