3.1.2. Execution Order for Channels and Pipes

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

Download PDF

ID 683176

Date 9/24/2018

Version 18.1

Public

Document Table of Contents

Document Table of Contents x

1. Introduction to Standard Edition Best Practices Guide 2. Reviewing Your Kernel's report.html File 3. OpenCL Kernel Design Best Practices 4. Profiling Your Kernel to Identify Performance Bottlenecks 5. Strategies for Improving Single Work-Item Kernel Performance 6. Strategies for Improving NDRange Kernel Data Processing Efficiency 7. Strategies for Improving Memory Access Efficiency 8. Strategies for Optimizing FPGA Area Usage A. Additional Information

1. Introduction to Standard Edition Best Practices Guide x

1.1. FPGA Overview 1.2. Pipelines 1.3. Single Work-Item Kernel versus NDRange Kernel 1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File x

2.1. High Level Design Report Layout 2.2. Reviewing the Report Summary 2.3. Reviewing Loop Information 2.4. Reviewing Area Information 2.5. Verifying Information on Memory Replication and Stalls 2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report 2.7. HTML Report: Area Report Messages 2.8. HTML Report: Kernel Design Concepts

2.3. Reviewing Loop Information x

2.3.1. Loop Analysis Report of an OpenCL Design Example 2.3.2. Changing the Memory Access Pattern Example 2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information x

2.4.1. Area Analysis by Source 2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls x

2.5.1. Features of the System Viewer 2.5.2. Features of the Kernel Memory Viewer

2.7. HTML Report: Area Report Messages x

2.7.1. Area Report Message for Board Interface 2.7.2. Area Report Message for Function Overhead 2.7.3. Area Report Message for State 2.7.4. Area Report Message for Feedback 2.7.5. Area Report Message for Constant Memory 2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts x

2.8.1. Kernels 2.8.2. Global Memory Interconnect 2.8.3. Local Memory 2.8.4. Nested Loops 2.8.5. Loops in a Single Work-Item Kernel 2.8.6. Channels 2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices x

3.1. Transferring Data Via Channels or OpenCL Pipes 3.2. Unrolling Loops 3.3. Optimizing Floating-Point Operations 3.4. Allocating Aligned Memory 3.5. Aligning a Struct with or without Padding 3.6. Maintaining Similar Structures for Vector Type Elements 3.7. Avoiding Pointer Aliasing 3.8. Avoid Expensive Functions 3.9. Avoiding Work-Item ID-Dependent Backward Branching

3.1. Transferring Data Via Channels or OpenCL Pipes x

3.1.1. Characteristics of Channels and Pipes 3.1.2. Execution Order for Channels and Pipes 3.1.3. Optimizing Buffer Inference for Channels or Pipes 3.1.4. Best Practices for Channels and Pipes

3.3. Optimizing Floating-Point Operations x

3.3.1. Floating-Point versus Fixed-Point Representations

4. Profiling Your Kernel to Identify Performance Bottlenecks x

4.1. Best Practices 4.2. GUI 4.3. Interpreting the Profiling Information 4.4. Limitations

4.2. GUI x

4.2.1. Source Code Tab 4.2.2. Kernel Execution Tab 4.2.3. Autorun Captures Tab

4.2.1. Source Code Tab x

4.2.1.1. Tool Tip Options

4.3. Interpreting the Profiling Information x

4.3.1. Stall, Occupancy, Bandwidth 4.3.2. Activity 4.3.3. Cache Hit 4.3.4. Profiler Analyses of Example OpenCL Design Scenarios 4.3.5. Autorun Profiler Data

4.3.1. Stall, Occupancy, Bandwidth x

4.3.1.1. Stalling Channels

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios x

4.3.4.1. High Stall Percentage 4.3.4.2. Low Occupancy Percentage 4.3.4.3. Low Bandwidth Efficiency 4.3.4.4. High Stall and High Occupancy Percentages 4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.7. Stalling Channels 4.3.4.8. High Stall and Low Occupancy Percentages

5. Strategies for Improving Single Work-Item Kernel Performance x

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback 5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays 5.3. Good Design Practices for Single Work-Item Kernel

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback x

5.1.1. Removing Loop-Carried Dependency 5.1.2. Relaxing Loop-Carried Dependency 5.1.3. Simplifying Loop-Carried Dependency 5.1.4. Transferring Loop-Carried Dependency to Local Memory 5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

6. Strategies for Improving NDRange Kernel Data Processing Efficiency x

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size 6.2. Kernel Vectorization 6.3. Multiple Compute Units 6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization 6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

6.2. Kernel Vectorization x

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units x

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

7. Strategies for Improving Memory Access Efficiency x

7.1. General Guidelines on Optimizing Memory Accesses 7.2. Optimize Global Memory Accesses 7.3. Performing Kernel Computations Using Constant, Local or Private Memory 7.4. Improving Kernel Performance by Banking the Local Memory 7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor 7.6. Minimizing the Memory Dependencies for Loop Pipelining

7.2. Optimize Global Memory Accesses x

7.2.1. Contiguous Memory Accesses 7.2.2. Manual Partitioning of Global Memory

7.2.2. Manual Partitioning of Global Memory x

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory x

7.3.1. Constant Cache Memory 7.3.2. Preloading Data to Local Memory 7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory x

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

8. Strategies for Optimizing FPGA Area Usage x

8.1. Compilation Considerations 8.2. Board Variant Selection Considerations 8.3. Memory Access Considerations 8.4. Arithmetic Operation Considerations 8.5. Data Type Selection Considerations

A. Additional Information x

A.1. Document Revision History for the Standard Edition Best Practices Guide

1. Introduction to Standard Edition Best Practices Guide

1.1. FPGA Overview

1.2. Pipelines

1.3. Single Work-Item Kernel versus NDRange Kernel

1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File

2.1. High Level Design Report Layout

2.2. Reviewing the Report Summary

2.3. Reviewing Loop Information

2.3.1. Loop Analysis Report of an OpenCL Design Example

2.3.2. Changing the Memory Access Pattern Example

2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information

2.4.1. Area Analysis by Source

2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls

2.5.1. Features of the System Viewer

2.5.2. Features of the Kernel Memory Viewer

2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report

2.7. HTML Report: Area Report Messages

2.7.1. Area Report Message for Board Interface

2.7.2. Area Report Message for Function Overhead

2.7.3. Area Report Message for State

2.7.4. Area Report Message for Feedback

2.7.5. Area Report Message for Constant Memory

2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts

2.8.1. Kernels

2.8.2. Global Memory Interconnect

2.8.3. Local Memory

2.8.4. Nested Loops

2.8.5. Loops in a Single Work-Item Kernel

2.8.6. Channels

2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices

3.1. Transferring Data Via Channels or OpenCL Pipes

3.1.1. Characteristics of Channels and Pipes

3.1.2. Execution Order for Channels and Pipes

3.1.3. Optimizing Buffer Inference for Channels or Pipes

3.1.4. Best Practices for Channels and Pipes

3.2. Unrolling Loops

3.3. Optimizing Floating-Point Operations

3.3.1. Floating-Point versus Fixed-Point Representations

3.4. Allocating Aligned Memory

3.5. Aligning a Struct with or without Padding

3.6. Maintaining Similar Structures for Vector Type Elements

3.7. Avoiding Pointer Aliasing

3.8. Avoid Expensive Functions

3.9. Avoiding Work-Item ID-Dependent Backward Branching

4. Profiling Your Kernel to Identify Performance Bottlenecks

4.1. Best Practices

4.2. GUI

4.2.1. Source Code Tab

4.2.1.1. Tool Tip Options

4.2.2. Kernel Execution Tab

4.2.3. Autorun Captures Tab

4.3. Interpreting the Profiling Information

4.3.1. Stall, Occupancy, Bandwidth

4.3.1.1. Stalling Channels

4.3.2. Activity

4.3.3. Cache Hit

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios

4.3.4.1. High Stall Percentage

4.3.4.2. Low Occupancy Percentage

4.3.4.3. Low Bandwidth Efficiency

4.3.4.4. High Stall and High Occupancy Percentages

4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.7. Stalling Channels

4.3.4.8. High Stall and Low Occupancy Percentages

4.3.5. Autorun Profiler Data

4.4. Limitations

5. Strategies for Improving Single Work-Item Kernel Performance

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback

5.1.1. Removing Loop-Carried Dependency

5.1.2. Relaxing Loop-Carried Dependency

5.1.3. Simplifying Loop-Carried Dependency

5.1.4. Transferring Loop-Carried Dependency to Local Memory

5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays

5.3. Good Design Practices for Single Work-Item Kernel

6. Strategies for Improving NDRange Kernel Data Processing Efficiency

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size

6.2. Kernel Vectorization

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization

6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

7. Strategies for Improving Memory Access Efficiency

7.1. General Guidelines on Optimizing Memory Accesses

7.2. Optimize Global Memory Accesses

7.2.1. Contiguous Memory Accesses

7.2.2. Manual Partitioning of Global Memory

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory

7.3.1. Constant Cache Memory

7.3.2. Preloading Data to Local Memory

7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor

7.6. Minimizing the Memory Dependencies for Loop Pipelining

8. Strategies for Optimizing FPGA Area Usage

8.1. Compilation Considerations

8.2. Board Variant Selection Considerations

8.3. Memory Access Considerations

8.4. Arithmetic Operation Considerations

8.5. Data Type Selection Considerations

A. Additional Information

A.1. Document Revision History for the Standard Edition Best Practices Guide

3.1.2. Execution Order for Channels and Pipes

Each channel or pipe call in a kernel program translates into an instruction executed in the FPGA pipeline. The execution of a channel call or a pipe call occurs if a valid work-item executes through the pipeline. However, even if there is no control or data dependence between channel or pipe calls, their execution might not achieve perfect instruction-level parallelism in the kernel pipeline.

Consider the following code examples:

Kernel with Two Read Channel Calls Kernel with Two Read Pipe Calls

Table 6. Kernel with Two Read Channel or Pipe Calls
Kernel with Two Read Channel Calls	Kernel with Two Read Pipe Calls
`__kernel void consumer (__global uintrestrict dst) { for (int i = 0; i < 5; i++) { dst[2i] = read_channel_intel(c0); dst[2*i+2] = read_channel_intel(c1); } }`	`__kernel void consumer (__global uintrestrict dst, read_only pipe uint __attribute__((blocking)) c0, read_only pipe uint __attribute__((blocking)) c1) { for (int i = 0; i < 5; i++) { read_pipe (c0, &dst[2i]); read_pipe (c1, &dst[2*i+2]); } }`

__kernel void
consumer (__global uint*restrict dst) {
  for (int i = 0; i < 5; i++) {
    dst[2*i] = read_channel_intel(c0);
    dst[2*i+2] = read_channel_intel(c1);
  }
}

__kernel void
consumer (__global uint*restrict dst,
  read_only pipe uint 
    __attribute__((blocking)) c0,
  read_only pipe uint
    __attribute__((blocking)) c1)
{
  for (int i = 0; i < 5; i++) {
    read_pipe (c0, &dst[2*i]);
    read_pipe (c1, &dst[2*i+2]);
  }
}

The code example on the left makes two read channel calls. The code example on the right makes two read pipe calls. In most cases, the kernel executes these channel or pipe calls in parallel; however, channel and pipe call executions might occur out of sequence. Out-of-sequence execution means that the read operation from c1 can occur and complete before the read operation from c0.

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

3.1.2. Execution Order for Channels and Pipes