3.1. Transferring Data Via Channels or OpenCL Pipes

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

Download PDF

ID 683176

Date 9/24/2018

Version 18.1

Public

Document Table of Contents

Document Table of Contents x

1. Introduction to Standard Edition Best Practices Guide 2. Reviewing Your Kernel's report.html File 3. OpenCL Kernel Design Best Practices 4. Profiling Your Kernel to Identify Performance Bottlenecks 5. Strategies for Improving Single Work-Item Kernel Performance 6. Strategies for Improving NDRange Kernel Data Processing Efficiency 7. Strategies for Improving Memory Access Efficiency 8. Strategies for Optimizing FPGA Area Usage A. Additional Information

1. Introduction to Standard Edition Best Practices Guide x

1.1. FPGA Overview 1.2. Pipelines 1.3. Single Work-Item Kernel versus NDRange Kernel 1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File x

2.1. High Level Design Report Layout 2.2. Reviewing the Report Summary 2.3. Reviewing Loop Information 2.4. Reviewing Area Information 2.5. Verifying Information on Memory Replication and Stalls 2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report 2.7. HTML Report: Area Report Messages 2.8. HTML Report: Kernel Design Concepts

2.3. Reviewing Loop Information x

2.3.1. Loop Analysis Report of an OpenCL Design Example 2.3.2. Changing the Memory Access Pattern Example 2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information x

2.4.1. Area Analysis by Source 2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls x

2.5.1. Features of the System Viewer 2.5.2. Features of the Kernel Memory Viewer

2.7. HTML Report: Area Report Messages x

2.7.1. Area Report Message for Board Interface 2.7.2. Area Report Message for Function Overhead 2.7.3. Area Report Message for State 2.7.4. Area Report Message for Feedback 2.7.5. Area Report Message for Constant Memory 2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts x

2.8.1. Kernels 2.8.2. Global Memory Interconnect 2.8.3. Local Memory 2.8.4. Nested Loops 2.8.5. Loops in a Single Work-Item Kernel 2.8.6. Channels 2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices x

3.1. Transferring Data Via Channels or OpenCL Pipes 3.2. Unrolling Loops 3.3. Optimizing Floating-Point Operations 3.4. Allocating Aligned Memory 3.5. Aligning a Struct with or without Padding 3.6. Maintaining Similar Structures for Vector Type Elements 3.7. Avoiding Pointer Aliasing 3.8. Avoid Expensive Functions 3.9. Avoiding Work-Item ID-Dependent Backward Branching

3.1. Transferring Data Via Channels or OpenCL Pipes x

3.1.1. Characteristics of Channels and Pipes 3.1.2. Execution Order for Channels and Pipes 3.1.3. Optimizing Buffer Inference for Channels or Pipes 3.1.4. Best Practices for Channels and Pipes

3.3. Optimizing Floating-Point Operations x

3.3.1. Floating-Point versus Fixed-Point Representations

4. Profiling Your Kernel to Identify Performance Bottlenecks x

4.1. Best Practices 4.2. GUI 4.3. Interpreting the Profiling Information 4.4. Limitations

4.2. GUI x

4.2.1. Source Code Tab 4.2.2. Kernel Execution Tab 4.2.3. Autorun Captures Tab

4.2.1. Source Code Tab x

4.2.1.1. Tool Tip Options

4.3. Interpreting the Profiling Information x

4.3.1. Stall, Occupancy, Bandwidth 4.3.2. Activity 4.3.3. Cache Hit 4.3.4. Profiler Analyses of Example OpenCL Design Scenarios 4.3.5. Autorun Profiler Data

4.3.1. Stall, Occupancy, Bandwidth x

4.3.1.1. Stalling Channels

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios x

4.3.4.1. High Stall Percentage 4.3.4.2. Low Occupancy Percentage 4.3.4.3. Low Bandwidth Efficiency 4.3.4.4. High Stall and High Occupancy Percentages 4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.7. Stalling Channels 4.3.4.8. High Stall and Low Occupancy Percentages

5. Strategies for Improving Single Work-Item Kernel Performance x

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback 5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays 5.3. Good Design Practices for Single Work-Item Kernel

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback x

5.1.1. Removing Loop-Carried Dependency 5.1.2. Relaxing Loop-Carried Dependency 5.1.3. Simplifying Loop-Carried Dependency 5.1.4. Transferring Loop-Carried Dependency to Local Memory 5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

6. Strategies for Improving NDRange Kernel Data Processing Efficiency x

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size 6.2. Kernel Vectorization 6.3. Multiple Compute Units 6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization 6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

6.2. Kernel Vectorization x

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units x

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

7. Strategies for Improving Memory Access Efficiency x

7.1. General Guidelines on Optimizing Memory Accesses 7.2. Optimize Global Memory Accesses 7.3. Performing Kernel Computations Using Constant, Local or Private Memory 7.4. Improving Kernel Performance by Banking the Local Memory 7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor 7.6. Minimizing the Memory Dependencies for Loop Pipelining

7.2. Optimize Global Memory Accesses x

7.2.1. Contiguous Memory Accesses 7.2.2. Manual Partitioning of Global Memory

7.2.2. Manual Partitioning of Global Memory x

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory x

7.3.1. Constant Cache Memory 7.3.2. Preloading Data to Local Memory 7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory x

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

8. Strategies for Optimizing FPGA Area Usage x

8.1. Compilation Considerations 8.2. Board Variant Selection Considerations 8.3. Memory Access Considerations 8.4. Arithmetic Operation Considerations 8.5. Data Type Selection Considerations

A. Additional Information x

A.1. Document Revision History for the Standard Edition Best Practices Guide

1. Introduction to Standard Edition Best Practices Guide

1.1. FPGA Overview

1.2. Pipelines

1.3. Single Work-Item Kernel versus NDRange Kernel

1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File

2.1. High Level Design Report Layout

2.2. Reviewing the Report Summary

2.3. Reviewing Loop Information

2.3.1. Loop Analysis Report of an OpenCL Design Example

2.3.2. Changing the Memory Access Pattern Example

2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information

2.4.1. Area Analysis by Source

2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls

2.5.1. Features of the System Viewer

2.5.2. Features of the Kernel Memory Viewer

2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report

2.7. HTML Report: Area Report Messages

2.7.1. Area Report Message for Board Interface

2.7.2. Area Report Message for Function Overhead

2.7.3. Area Report Message for State

2.7.4. Area Report Message for Feedback

2.7.5. Area Report Message for Constant Memory

2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts

2.8.1. Kernels

2.8.2. Global Memory Interconnect

2.8.3. Local Memory

2.8.4. Nested Loops

2.8.5. Loops in a Single Work-Item Kernel

2.8.6. Channels

2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices

3.1. Transferring Data Via Channels or OpenCL Pipes

3.1.1. Characteristics of Channels and Pipes

3.1.2. Execution Order for Channels and Pipes

3.1.3. Optimizing Buffer Inference for Channels or Pipes

3.1.4. Best Practices for Channels and Pipes

3.2. Unrolling Loops

3.3. Optimizing Floating-Point Operations

3.3.1. Floating-Point versus Fixed-Point Representations

3.4. Allocating Aligned Memory

3.5. Aligning a Struct with or without Padding

3.6. Maintaining Similar Structures for Vector Type Elements

3.7. Avoiding Pointer Aliasing

3.8. Avoid Expensive Functions

3.9. Avoiding Work-Item ID-Dependent Backward Branching

4. Profiling Your Kernel to Identify Performance Bottlenecks

4.1. Best Practices

4.2. GUI

4.2.1. Source Code Tab

4.2.1.1. Tool Tip Options

4.2.2. Kernel Execution Tab

4.2.3. Autorun Captures Tab

4.3. Interpreting the Profiling Information

4.3.1. Stall, Occupancy, Bandwidth

4.3.1.1. Stalling Channels

4.3.2. Activity

4.3.3. Cache Hit

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios

4.3.4.1. High Stall Percentage

4.3.4.2. Low Occupancy Percentage

4.3.4.3. Low Bandwidth Efficiency

4.3.4.4. High Stall and High Occupancy Percentages

4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.7. Stalling Channels

4.3.4.8. High Stall and Low Occupancy Percentages

4.3.5. Autorun Profiler Data

4.4. Limitations

5. Strategies for Improving Single Work-Item Kernel Performance

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback

5.1.1. Removing Loop-Carried Dependency

5.1.2. Relaxing Loop-Carried Dependency

5.1.3. Simplifying Loop-Carried Dependency

5.1.4. Transferring Loop-Carried Dependency to Local Memory

5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays

5.3. Good Design Practices for Single Work-Item Kernel

6. Strategies for Improving NDRange Kernel Data Processing Efficiency

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size

6.2. Kernel Vectorization

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization

6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

7. Strategies for Improving Memory Access Efficiency

7.1. General Guidelines on Optimizing Memory Accesses

7.2. Optimize Global Memory Accesses

7.2.1. Contiguous Memory Accesses

7.2.2. Manual Partitioning of Global Memory

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory

7.3.1. Constant Cache Memory

7.3.2. Preloading Data to Local Memory

7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor

7.6. Minimizing the Memory Dependencies for Loop Pipelining

8. Strategies for Optimizing FPGA Area Usage

8.1. Compilation Considerations

8.2. Board Variant Selection Considerations

8.3. Memory Access Considerations

8.4. Arithmetic Operation Considerations

8.5. Data Type Selection Considerations

A. Additional Information

A.1. Document Revision History for the Standard Edition Best Practices Guide

3.1. Transferring Data Via Channels or OpenCL Pipes

To increase data transfer efficiency between kernels, implement the channels extension in your kernel programs. If you want to leverage the capabilities of channels but have the ability to run your kernel program using other SDKs, implement OpenCL pipes.

Sometimes, FPGA-to-global memory bandwidth constrains the data transfer efficiency between kernels. The theoretical maximum FPGA-to-global memory bandwidth varies depending on the number of global memory banks available in the targeted Custom Platform and board. To determine the theoretical maximum bandwidth for your board, refer to your board vendor's documentation.

In practice, a kernel does not achieve 100% utilization of the maximum global memory bandwidth available. The level of utilization depends on the access pattern of the algorithm.

If global memory bandwidth is a performance constraint for your OpenCL kernel, first try to break down the algorithm into multiple smaller kernels. Secondly, as shown in the figure below, eliminate some of the global memory accesses by implementing the SDK's channels or OpenCL pipes for data transfer between kernels.

Figure 59. Difference in Global Memory Access Pattern as a Result of Channels or Pipes Implementation

For more information on the usage of channels, refer to the Implementing Channels Extension section of the Standard Edition Programming Guide.

For more information on the usage of pipes, refer to the Implementing OpenCL Pipes section of the Standard Edition Programming Guide.

Section Content
Characteristics of Channels and Pipes
Execution Order for Channels and Pipes
Optimizing Buffer Inference for Channels or Pipes
Best Practices for Channels and Pipes

Related Information

Implementing Intel FPGA SDK for OpenCL Channels Extension

Implementing OpenCL Pipes

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

3.1. Transferring Data Via Channels or OpenCL Pipes