Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide
ID
683176
Date
9/24/2018
Public
1. Introduction to Standard Edition Best Practices Guide
2. Reviewing Your Kernel's report.html File
3. OpenCL Kernel Design Best Practices
4. Profiling Your Kernel to Identify Performance Bottlenecks
5. Strategies for Improving Single Work-Item Kernel Performance
6. Strategies for Improving NDRange Kernel Data Processing Efficiency
7. Strategies for Improving Memory Access Efficiency
8. Strategies for Optimizing FPGA Area Usage
A. Additional Information
2.1. High Level Design Report Layout
2.2. Reviewing the Report Summary
2.3. Reviewing Loop Information
2.4. Reviewing Area Information
2.5. Verifying Information on Memory Replication and Stalls
2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report
2.7. HTML Report: Area Report Messages
2.8. HTML Report: Kernel Design Concepts
3.1. Transferring Data Via Channels or OpenCL Pipes
3.2. Unrolling Loops
3.3. Optimizing Floating-Point Operations
3.4. Allocating Aligned Memory
3.5. Aligning a Struct with or without Padding
3.6. Maintaining Similar Structures for Vector Type Elements
3.7. Avoiding Pointer Aliasing
3.8. Avoid Expensive Functions
3.9. Avoiding Work-Item ID-Dependent Backward Branching
4.3.4.1. High Stall Percentage
4.3.4.2. Low Occupancy Percentage
4.3.4.3. Low Bandwidth Efficiency
4.3.4.4. High Stall and High Occupancy Percentages
4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency
4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency
4.3.4.7. Stalling Channels
4.3.4.8. High Stall and Low Occupancy Percentages
7.1. General Guidelines on Optimizing Memory Accesses
7.2. Optimize Global Memory Accesses
7.3. Performing Kernel Computations Using Constant, Local or Private Memory
7.4. Improving Kernel Performance by Banking the Local Memory
7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor
7.6. Minimizing the Memory Dependencies for Loop Pipelining
5.1.2. Relaxing Loop-Carried Dependency
Based on the feedback from the optimization report, you can relax a loop-carried dependency by increasing the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occurs between the generation of a loop-carried value and its usage.
Consider the following code example:
1 #define N 128
2
3 __kernel void unoptimized (__global float * restrict A,
4 __global float * restrict result)
5 {
6 float mul = 1.0f;
7
8 for (unsigned i = 0; i < N; i++)
9 mul *= A[i];
10
11 * result = mul;
12 }
================================================================================== Kernel: unoptimized ================================================================================== The kernel is compiled for single work-item execution. Loop Report: + Loop "Block1" (file unoptimized.cl line 8) Pipelined with successive iterations launched every 6 cycles due to: Data dependency on variable mul (file unoptimized.cl line 9) Largest Critical Path Contributor: 100%: Fmul Operation (file unoptimized.cl line 9) ===================================================================================
The optimization report above shows that the infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable mul causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 9 (that is, mul *= A[i]) contributes the largest delay to the computation of the variable mul.
To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on M copies of the variable and use one copy every M iterations:
- Declare multiple copies of the variable mul (for example, in an array called mul_copies).
- Initialize all the copies of mul_copies.
- Use the last copy in the array in the multiplication operation.
- Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
- Reduce all the copies to mul and write the final value to result.
Below is the restructured kernel:
1 #define N 128
2 #define M 8
3
4 __kernel void optimized (__global float * restrict A,
5 __global float * restrict result)
6 {
7 float mul = 1.0f;
8
9 // Step 1: Declare multiple copies of variable mul
10 float mul_copies[M];
11
12 // Step 2: Initialize all copies
13 for (unsigned i = 0; i < M; i++)
14 mul_copies[i] = 1.0f;
15
16 for (unsigned i = 0; i < N; i++) {
17 // Step 3: Perform multiplication on the last copy
18 float cur = mul_copies[M-1] * A[i];
19
20 // Step 4a: Shift copies
21 #pragma unroll
22 for (unsigned j = M-1; j > 0; j--)
23 mul_copies[j] = mul_copies[j-1];
24
25 // Step 4b: Insert updated copy at the beginning
26 mul_copies[0] = cur;
27 }
28
29 // Step 5: Perform reduction on copies
30 #pragma unroll
31 for (unsigned i = 0; i < M; i++)
32 mul *= mul_copies[i];
33
34 * result = mul;
35 }
An optimization report similar to the one below indicates the successful relaxation of the loop-carried dependency on the variable mul:
================================================================================== Kernel: optimized ================================================================================== The kernel is compiled for single work-item execution. Loop Report: + Fully unrolled loop (file optimized2.cl line 13) Loop was automatically and fully unrolled. Add "#pragma unroll 1" to prevent automatic unrolling. + Loop "Block1" (file optimized2.cl line 16) | Pipelined well. Successive iterations are launched every cycle. | | |-+ Fully unrolled loop (file optimized2.cl line 22) Loop was fully unrolled due to "#pragma unroll" annotation. + Fully unrolled loop (file optimized2.cl line 31) Loop was fully unrolled due to "#pragma unroll" annotation.