1.3. Performance

AN 870: Stencil Computation Reference Design

Download PDF

ID 683051

Date 10/10/2018

Version current

Public

Visible to Intel only — GUID: usv1532619260272

Ixiasoft

View Details

1.3. Performance

The performance³ on an Intel® Arria® 10 GX FPGA Development Kit (A10GX1150) was compared against experimental results reported by the academic paper OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology ⁴, published May 2017. Data in the paper was collected by running 15360 sweeps of the stencil pattern. The pattern was optimized for the following GPUs and CPUs:

GPUs

NVIDIA* Tesla* C2075 companion processor (C2075)
NVIDIA* GeForce* GTX 760 graphics card (GTX760)
NVIDIA* GeForce* GTX 960 graphics card (GTX960)

CPUs

Intel® Xeon® Processor E5-1650 V3 (E5-1650 V3)
Intel® Core® i7-4960X Processor Extreme Edition (i7-4960X)

Thirty kernels were chained together in a feed-forward approach in order to perform 30 iterations of the stencil algorithm in parallel. Each individual kernel began execution as soon as it was sent enough information from the previous kernel.

If optimized correctly, the kernel can be altered easily to alternate between reading/writing from both global memory objects and run the 30 sweeps as many times as wanted. The execution time in this case is the same as is reported for the non-repeating case. The execution time might be less if other optimizations are applied. The following table outlines the execution time over all 15360 sweeps on a 32768x4096 float data set. Lower execution times are better.

Table 1. Execution Time Comparison
Processor	Execution Time (s)
A10GX1150	42.455
C2075	232.4
GTX760	176.5
GTX960	111.7
E5-1650 V3	258.9
i7-4960X	260.2

The execution times shown in the following chart were recorded using the stencil computation reference design running on an Intel® Arria® 10 GX FPGA Development Kit board. If the following optimizations were to be applied, execution time can be improved significantly. The single largest performance boost would be to upgrade to a larger device, such as moving to a Intel® Stratix® 10.

Compiling with a newer version of Intel® Quartus® Prime Design Suite
Fitting more calculation kernels in the chain
Using an FPGA device that is larger, faster, or both
Removing profiling hardware

The following table shows the complete chart of execution times on an Intel® Arria® 10 GX FPGA Development Kit for three data set sizes:

No. of Kernels	Data set 4088x65536			Data set 4088x32760			Data set 4088x4088
No. of Kernels	Exec. Time (ms)	GFLOPS	Stall % (Worst)	Exec. Time (ms)	GFLOPS	Stall % (Worst)	Exec. Time (ms)	GFLOPS	Stall % (Worst)
1	151.34	7.081040518	29.9	75.75	7.0718352	29.25	9.71	6.8843	34.41
2	148.89	14.39511951	34.74	74.56	14.369408	34.67	9.62	13.898	38.03
3	150.16	21.41005605	32.45	75.1	21.399129	32.44	9.2	21.798	30.05
10	150.33	71.28614861	35.03	75.23	71.207167	34.99	9.51	70.291	35.23
20	164.24	130.4974028	19.78	82.12	130.46554	20.09	10.38	128.8	33.22
28	165.77	181.0101394	15.41	82.88	180.97686	15.13	10.45	179.11	22.77
29	162.62	191.1062322	22.53	81.4	190.84833	22.78	10.42	186.04	15.94
30	165.8	193.9043435	17.46	82.92	193.81025	17.35	10.43	192.27	12.03

The following heat maps show the convergence of values after running the kernel. The first pair of images represents the unprocessed raw input. The heat maps illustrate 280x280 grids, where the values in the left maps were initially created with a pseudorandom function, and the values in the right maps were created by looping from 1 to 100.

To calculate the execution time for a system requiring more than 30 iterations, divide the desired number of iterations by 30 multiplied by the time it takes 30 iterations to run ( $(d e s i r e d n u m b e r o f i t e r a t i o n s \div 30) \times (t i m e f o r 30 i t e r a t i o n t o r u n)$ ) For example, 300 iterations applied to a 4088x65536 data set should take around 1658 ms to run.

The following heat map shows the result of running the initial values through a kernel with 30 chained calculation CUs. You can see the beginnings of convergence emerge.

³ Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

⁴ H. M. Waidyasooriya, Y. Takei, S. Tatsumi and M. Hariyama, "OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology," in IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1390-1402, May 1 2017.

doi: 10.1109/TPDS.2016.2614981

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582502&isnumber=7894348

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

AN 870: Stencil Computation Reference Design

1.3. Performance