4. Profiling Your Kernel to Identify Performance Bottlenecks

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

Download PDF

ID 683176

Date 9/24/2018

Version 18.1

Public

Document Table of Contents

Document Table of Contents x

1. Introduction to Standard Edition Best Practices Guide 2. Reviewing Your Kernel's report.html File 3. OpenCL Kernel Design Best Practices 4. Profiling Your Kernel to Identify Performance Bottlenecks 5. Strategies for Improving Single Work-Item Kernel Performance 6. Strategies for Improving NDRange Kernel Data Processing Efficiency 7. Strategies for Improving Memory Access Efficiency 8. Strategies for Optimizing FPGA Area Usage A. Additional Information

1. Introduction to Standard Edition Best Practices Guide x

1.1. FPGA Overview 1.2. Pipelines 1.3. Single Work-Item Kernel versus NDRange Kernel 1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File x

2.1. High Level Design Report Layout 2.2. Reviewing the Report Summary 2.3. Reviewing Loop Information 2.4. Reviewing Area Information 2.5. Verifying Information on Memory Replication and Stalls 2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report 2.7. HTML Report: Area Report Messages 2.8. HTML Report: Kernel Design Concepts

2.3. Reviewing Loop Information x

2.3.1. Loop Analysis Report of an OpenCL Design Example 2.3.2. Changing the Memory Access Pattern Example 2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information x

2.4.1. Area Analysis by Source 2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls x

2.5.1. Features of the System Viewer 2.5.2. Features of the Kernel Memory Viewer

2.7. HTML Report: Area Report Messages x

2.7.1. Area Report Message for Board Interface 2.7.2. Area Report Message for Function Overhead 2.7.3. Area Report Message for State 2.7.4. Area Report Message for Feedback 2.7.5. Area Report Message for Constant Memory 2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts x

2.8.1. Kernels 2.8.2. Global Memory Interconnect 2.8.3. Local Memory 2.8.4. Nested Loops 2.8.5. Loops in a Single Work-Item Kernel 2.8.6. Channels 2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices x

3.1. Transferring Data Via Channels or OpenCL Pipes 3.2. Unrolling Loops 3.3. Optimizing Floating-Point Operations 3.4. Allocating Aligned Memory 3.5. Aligning a Struct with or without Padding 3.6. Maintaining Similar Structures for Vector Type Elements 3.7. Avoiding Pointer Aliasing 3.8. Avoid Expensive Functions 3.9. Avoiding Work-Item ID-Dependent Backward Branching

3.1. Transferring Data Via Channels or OpenCL Pipes x

3.1.1. Characteristics of Channels and Pipes 3.1.2. Execution Order for Channels and Pipes 3.1.3. Optimizing Buffer Inference for Channels or Pipes 3.1.4. Best Practices for Channels and Pipes

3.3. Optimizing Floating-Point Operations x

3.3.1. Floating-Point versus Fixed-Point Representations

4. Profiling Your Kernel to Identify Performance Bottlenecks x

4.1. Best Practices 4.2. GUI 4.3. Interpreting the Profiling Information 4.4. Limitations

4.2. GUI x

4.2.1. Source Code Tab 4.2.2. Kernel Execution Tab 4.2.3. Autorun Captures Tab

4.2.1. Source Code Tab x

4.2.1.1. Tool Tip Options

4.3. Interpreting the Profiling Information x

4.3.1. Stall, Occupancy, Bandwidth 4.3.2. Activity 4.3.3. Cache Hit 4.3.4. Profiler Analyses of Example OpenCL Design Scenarios 4.3.5. Autorun Profiler Data

4.3.1. Stall, Occupancy, Bandwidth x

4.3.1.1. Stalling Channels

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios x

4.3.4.1. High Stall Percentage 4.3.4.2. Low Occupancy Percentage 4.3.4.3. Low Bandwidth Efficiency 4.3.4.4. High Stall and High Occupancy Percentages 4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.7. Stalling Channels 4.3.4.8. High Stall and Low Occupancy Percentages

5. Strategies for Improving Single Work-Item Kernel Performance x

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback 5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays 5.3. Good Design Practices for Single Work-Item Kernel

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback x

5.1.1. Removing Loop-Carried Dependency 5.1.2. Relaxing Loop-Carried Dependency 5.1.3. Simplifying Loop-Carried Dependency 5.1.4. Transferring Loop-Carried Dependency to Local Memory 5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

6. Strategies for Improving NDRange Kernel Data Processing Efficiency x

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size 6.2. Kernel Vectorization 6.3. Multiple Compute Units 6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization 6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

6.2. Kernel Vectorization x

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units x

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

7. Strategies for Improving Memory Access Efficiency x

7.1. General Guidelines on Optimizing Memory Accesses 7.2. Optimize Global Memory Accesses 7.3. Performing Kernel Computations Using Constant, Local or Private Memory 7.4. Improving Kernel Performance by Banking the Local Memory 7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor 7.6. Minimizing the Memory Dependencies for Loop Pipelining

7.2. Optimize Global Memory Accesses x

7.2.1. Contiguous Memory Accesses 7.2.2. Manual Partitioning of Global Memory

7.2.2. Manual Partitioning of Global Memory x

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory x

7.3.1. Constant Cache Memory 7.3.2. Preloading Data to Local Memory 7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory x

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

8. Strategies for Optimizing FPGA Area Usage x

8.1. Compilation Considerations 8.2. Board Variant Selection Considerations 8.3. Memory Access Considerations 8.4. Arithmetic Operation Considerations 8.5. Data Type Selection Considerations

A. Additional Information x

A.1. Document Revision History for the Standard Edition Best Practices Guide

1. Introduction to Standard Edition Best Practices Guide

1.1. FPGA Overview

1.2. Pipelines

1.3. Single Work-Item Kernel versus NDRange Kernel

1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File

2.1. High Level Design Report Layout

2.2. Reviewing the Report Summary

2.3. Reviewing Loop Information

2.3.1. Loop Analysis Report of an OpenCL Design Example

2.3.2. Changing the Memory Access Pattern Example

2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information

2.4.1. Area Analysis by Source

2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls

2.5.1. Features of the System Viewer

2.5.2. Features of the Kernel Memory Viewer

2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report

2.7. HTML Report: Area Report Messages

2.7.1. Area Report Message for Board Interface

2.7.2. Area Report Message for Function Overhead

2.7.3. Area Report Message for State

2.7.4. Area Report Message for Feedback

2.7.5. Area Report Message for Constant Memory

2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts

2.8.1. Kernels

2.8.2. Global Memory Interconnect

2.8.3. Local Memory

2.8.4. Nested Loops

2.8.5. Loops in a Single Work-Item Kernel

2.8.6. Channels

2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices

3.1. Transferring Data Via Channels or OpenCL Pipes

3.1.1. Characteristics of Channels and Pipes

3.1.2. Execution Order for Channels and Pipes

3.1.3. Optimizing Buffer Inference for Channels or Pipes

3.1.4. Best Practices for Channels and Pipes

3.2. Unrolling Loops

3.3. Optimizing Floating-Point Operations

3.3.1. Floating-Point versus Fixed-Point Representations

3.4. Allocating Aligned Memory

3.5. Aligning a Struct with or without Padding

3.6. Maintaining Similar Structures for Vector Type Elements

3.7. Avoiding Pointer Aliasing

3.8. Avoid Expensive Functions

3.9. Avoiding Work-Item ID-Dependent Backward Branching

4. Profiling Your Kernel to Identify Performance Bottlenecks

4.1. Best Practices

4.2. GUI

4.2.1. Source Code Tab

4.2.1.1. Tool Tip Options

4.2.2. Kernel Execution Tab

4.2.3. Autorun Captures Tab

4.3. Interpreting the Profiling Information

4.3.1. Stall, Occupancy, Bandwidth

4.3.1.1. Stalling Channels

4.3.2. Activity

4.3.3. Cache Hit

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios

4.3.4.1. High Stall Percentage

4.3.4.2. Low Occupancy Percentage

4.3.4.3. Low Bandwidth Efficiency

4.3.4.4. High Stall and High Occupancy Percentages

4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.7. Stalling Channels

4.3.4.8. High Stall and Low Occupancy Percentages

4.3.5. Autorun Profiler Data

4.4. Limitations

5. Strategies for Improving Single Work-Item Kernel Performance

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback

5.1.1. Removing Loop-Carried Dependency

5.1.2. Relaxing Loop-Carried Dependency

5.1.3. Simplifying Loop-Carried Dependency

5.1.4. Transferring Loop-Carried Dependency to Local Memory

5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays

5.3. Good Design Practices for Single Work-Item Kernel

6. Strategies for Improving NDRange Kernel Data Processing Efficiency

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size

6.2. Kernel Vectorization

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization

6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

7. Strategies for Improving Memory Access Efficiency

7.1. General Guidelines on Optimizing Memory Accesses

7.2. Optimize Global Memory Accesses

7.2.1. Contiguous Memory Accesses

7.2.2. Manual Partitioning of Global Memory

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory

7.3.1. Constant Cache Memory

7.3.2. Preloading Data to Local Memory

7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor

7.6. Minimizing the Memory Dependencies for Loop Pipelining

8. Strategies for Optimizing FPGA Area Usage

8.1. Compilation Considerations

8.2. Board Variant Selection Considerations

8.3. Memory Access Considerations

8.4. Arithmetic Operation Considerations

8.5. Data Type Selection Considerations

A. Additional Information

A.1. Document Revision History for the Standard Edition Best Practices Guide

4. Profiling Your Kernel to Identify Performance Bottlenecks

The generates data that helps you assess OpenCL™ kernel performance. The instruments the kernel pipeline with performance counters. These counters collect kernel performance data, which you can review via the profiler GUI.

Consider the following OpenCL kernel program:

__kernel void add (__global int * a,
                   __global int * b, 
                   __global int * c)
{
    int gid = get_global_id(0);
    c[gid] = a[gid]+b[gid];
}

As shown in the figure below, the Profiler instruments and connects performance counters in a daisy chain throughout the pipeline generated for the kernel program. The host then reads the data collected by these counters. For example, in PCI Express® (PCIe®)-based systems, the host reads the data via the PCIe control register access (CRA) or control and status register (CSR) port.

Figure 62. : Performance Counters Instrumentation

Work-item execution stalls might occur at various stages of an pipeline. Applications with large amounts of memory accesses or load and store operations might stall frequently to enable the completion of memory transfers. The Profiler helps identify the load and store operations or channel accesses that cause the majority of stalls within a kernel pipeline.

For usage information on the , refer to the Profiling Your OpenCL Kernel section of the Standard Edition Programming Guide.

Section Content
Best Practices
GUI
Interpreting the Profiling Information
Limitations

Related Information

Profiling Your OpenCL Kernel

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

4. Profiling Your Kernel to Identify Performance Bottlenecks