Minimize the Memory Dependencies for Loop Pipelining

Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

Download PDF

ID 767853

Date 12/16/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Document Table of Contents x

FPGA Optimization Guide for Intel® oneAPI Toolkits

FPGA Optimization Guide for Intel® oneAPI Toolkits x

Introduction To FPGA Design Concepts Analyze Your Design Optimize Your Design FPGA Optimization Flags, Attributes, Pragmas, and Extensions Quick Reference Additional Information Document Revision History for the FPGA Optimization Guide for Intel® oneAPI Toolkits Notices and Disclaimers

Introduction To FPGA Design Concepts x

FPGA Architecture Overview Concepts of FPGA Hardware Design Methods of Hardware Design How Source Code Becomes a Custom Hardware Datapath Scheduling Mapping Parallelism Models to FPGA Hardware Memory Types

FPGA Architecture Overview x

Adaptive Logic Module (ALM) Lookup Table (LUT) Register Digital Signal Processing (DSP) Block Random Access Memory (RAM) Blocks

Concepts of FPGA Hardware Design x

Maximum Frequency (f_MAX) Latency Pipelining Throughput Datapath Control Path Occupancy

How Source Code Becomes a Custom Hardware Datapath x

Mapping Source Code Instructions to Hardware Mapping Arrays and Their Accesses to Hardware

Scheduling x

Dynamic Scheduling Clustering the Datapath Handshaking Between Clusters

Mapping Parallelism Models to FPGA Hardware x

Data Parallelism Task Parallelism

Data Parallelism x

Executing Independent Operations Simultaneously Pipelining

Memory Types x

Kernel Memory Global Memory

Analyze Your Design x

Analyze the FPGA Early Image Analyze the FPGA Image

Analyze the FPGA Early Image x

Review the FPGA Optimization Report Access HLD FPGA Reports in JSON Format

Review the FPGA Optimization Report x

Loop Analysis Bottlenecks Viewer Area Estimates System Viewer Kernel Memory Viewer Schedule Viewer

Analyze the FPGA Image x

Quartus (Static) Summary Intel® FPGA Dynamic Profiler for DPC++ System-level Profiling Using the Intercept Layer for OpenCL* Applications

Quartus (Static) Summary x

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++ x

Measure Kernel Performance Instrument the Kernel Pipeline with Performance Counters (-Xsprofile) Obtain Profiling Data During Runtime Reduce Area Resource Use While Profiling Profiler Analyses of Example SYCL* Design Scenarios Limitations

Obtain Profiling Data During Runtime x

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data Use Intel® VTune™ Profiler

Use Intel® VTune™ Profiler x

Interpret Performance Counter Data

System-level Profiling Using the Intercept Layer for OpenCL* Applications x

Set Up the Intercept Layer for OpenCL* Applications

Optimize Your Design x

Throughput Resource Use

Throughput x

Single Work-item Kernels NDRange Kernels Memory Accesses Pipes Host

Single Work-item Kernels x

Single Work-item Kernel Design Guidelines Loops Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Loops x

Refactor the Loop-Carried Data Dependency Relax Loop-Carried Dependency Transfer Loop-Carried Dependency to Local Memory Minimize the Memory Dependencies for Loop Pipelining Unroll Loops Fuse Loops to Reduce Overhead and Improve Performance Optimize Loops With Loop Speculation Remove Loop Bottlenecks Shannonization to Improve F_MAX/II Optimize Inner Loop Throughput Improve Loop Performance by Caching On-Chip Memory

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels x

Strategies for Inferring the Accumulator

Memory Accesses x

Load-Store Units Global Memory Accesses Optimization Perform Kernel Computations Using Local or Private Memory Local and Private Memory Accesses Optimization Annotating Unified Shared Memory Pointers Zero-Copy Memory Access Additional Recommendations

Load-Store Units x

Load-Store Unit Styles Load-Store Unit Modifiers Load-Store Unit Controls

Global Memory Accesses Optimization x

Global Memory Bandwidth Use Calculation Manual Partition of Global Memory Partitioning Buffers Across Different Memory Types (Heterogeneous Memory) Partitioning Buffers Across Memory Channels of the Same Memory Type Ignoring Dependencies Between Accessor Arguments Contiguous Memory Accesses Static Memory Coalescing

Host x

Multi-Threaded Host Application Utilizing Hardware Kernel Invocation Queue Double Buffering Host Utilizing Kernel Invocation Queue N-Way Buffering to Overlap Kernel Execution Prepinning Memory Simple Host-Device Streaming Buffered Host-Device Streaming

Double Buffering Host Utilizing Kernel Invocation Queue x

Applying Double-Buffering Using the Intercept Layer for OpenCL* Applications

Resource Use x

Data Types and Operations Kernel Variable Accesses

Data Types and Operations x

Optimize Floating-point Operation Avoid Expensive Functions Variable-Precision Integer and Floating-Point Support

Variable-Precision Integer and Floating-Point Support x

Advantages and Limitations of Arbitrary Precision Data Types Declare and Use the AC Data Types

Declare and Use the AC Data Types x

Declare the ac_int Data Type Declare the ac_fixed Data Type Declare the ac_complex Data Type Declare the ap_float Data Type

Declare the ap_float Data Type x

Conversion Rules for ap_float Operations with Explicit Precision Controls Comparison Operators Additional ap_float Functions Additional Data Types Provided by the ap_float.hpp Header File Quality of Results and the ap_float Data Type

FPGA Optimization Flags, Attributes, Pragmas, and Extensions x

Optimization Flags Kernel Attributes Kernel Controls Kernel Variables Memory Attributes Loop Directives Floating-Point Pragmas Latency Controls (Beta) System of Tasks Extension (task_sequence)

Optimization Flags x

Specify Schedule F_MAX Target for Kernels (-Xsclock=<clock target>) Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_type>) Force Ring Interconnect for Global Memory (-Xsglobal-ring) Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring) Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder) Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue) Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking) Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion) Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion) Pipeline Loops in Non-task Kernels (-Xsauto-pipeline) Control Semantics of Floating-Point Operations (-fp-model=<var><value></var> ) Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>) Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<var><value></var> ) Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<var><N></var>) Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<var><option></var> )

Kernel Attributes x

Specify Schedule F_MAX Target for Kernels Specify a Workgroup Size Specify Number of SIMD WorkItems Omit Hardware that Generates and Dispatches Kernel IDs Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels Reduce Kernel Area and Latency

Kernel Controls x

Pipes Extension

Pipes Extension x

Key Properties of a Pipe Accessing Pipes The pipe Class and its Use I/O Pipes Characteristics of Pipes Restrictions of Pipes Guidelines for Designing Pipes Pipe and Atomic Fence

Loop Directives x

disable_loop_pipelining Attribute initiation_interval Attribute ivdep Attribute loop_coalesce Attribute max_concurrency Attribute max_interleaving Attribute speculated_iterations Attribute unroll Pragma Loop Fuse Functions and nofusion Attribute

System of Tasks Extension (task_sequence) x

Task Functions task_sequence Use Cases

Quick Reference x

Algorithmic C Data Types Floating Point Pragmas FPGA Accessor Properties FPGA Extensions FPGA Kernel Attributes FPGA Local Memory Function Latency Control Properties (Beta) FPGA LSU Controls FPGA Loop Directives FPGA Memory Attributes FPGA Optimization Flags Pipe API task_sequence Template Parameters and Function APIs

FPGA Optimization Guide for Intel® oneAPI Toolkits

Introduction To FPGA Design Concepts

FPGA Architecture Overview

Adaptive Logic Module (ALM)

Lookup Table (LUT)

Digital Signal Processing (DSP) Block

Random Access Memory (RAM) Blocks

Concepts of FPGA Hardware Design

Maximum Frequency (f_MAX)

Latency

Pipelining

Throughput

Datapath

Control Path

Occupancy

Methods of Hardware Design

How Source Code Becomes a Custom Hardware Datapath

Mapping Source Code Instructions to Hardware

Mapping Arrays and Their Accesses to Hardware

Scheduling

Dynamic Scheduling

Clustering the Datapath

Handshaking Between Clusters

Mapping Parallelism Models to FPGA Hardware

Data Parallelism

Executing Independent Operations Simultaneously

Pipelining

Task Parallelism

Memory Types

Kernel Memory

Global Memory

Analyze Your Design

Analyze the FPGA Early Image

Review the FPGA Optimization Report

Loop Analysis

Bottlenecks Viewer

Area Estimates

System Viewer

Kernel Memory Viewer

Schedule Viewer

Access HLD FPGA Reports in JSON Format

Analyze the FPGA Image

Quartus (Static) Summary

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++

Measure Kernel Performance

Instrument the Kernel Pipeline with Performance Counters (-Xsprofile)

Obtain Profiling Data During Runtime

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data

Use Intel® VTune™ Profiler

Interpret Performance Counter Data

Reduce Area Resource Use While Profiling

Profiler Analyses of Example SYCL* Design Scenarios

Limitations

System-level Profiling Using the Intercept Layer for OpenCL* Applications

Set Up the Intercept Layer for OpenCL* Applications

Optimize Your Design

Throughput

Single Work-item Kernels

Single Work-item Kernel Design Guidelines

Loops

Refactor the Loop-Carried Data Dependency

Relax Loop-Carried Dependency

Transfer Loop-Carried Dependency to Local Memory

Minimize the Memory Dependencies for Loop Pipelining

Unroll Loops

Fuse Loops to Reduce Overhead and Improve Performance

Optimize Loops With Loop Speculation

Remove Loop Bottlenecks

Shannonization to Improve F_MAX/II

Optimize Inner Loop Throughput

Improve Loop Performance by Caching On-Chip Memory

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Strategies for Inferring the Accumulator

NDRange Kernels

Memory Accesses

Load-Store Units

Load-Store Unit Styles

Load-Store Unit Modifiers

Load-Store Unit Controls

Global Memory Accesses Optimization

Global Memory Bandwidth Use Calculation

Manual Partition of Global Memory

Partitioning Buffers Across Different Memory Types (Heterogeneous Memory)

Partitioning Buffers Across Memory Channels of the Same Memory Type

Ignoring Dependencies Between Accessor Arguments

Contiguous Memory Accesses

Static Memory Coalescing

Perform Kernel Computations Using Local or Private Memory

Local and Private Memory Accesses Optimization

Annotating Unified Shared Memory Pointers

Zero-Copy Memory Access

Additional Recommendations

Pipes

Host

Multi-Threaded Host Application

Utilizing Hardware Kernel Invocation Queue

Double Buffering Host Utilizing Kernel Invocation Queue

Applying Double-Buffering Using the Intercept Layer for OpenCL* Applications

N-Way Buffering to Overlap Kernel Execution

Prepinning Memory

Simple Host-Device Streaming

Buffered Host-Device Streaming

Resource Use

Data Types and Operations

Optimize Floating-point Operation

Avoid Expensive Functions

Variable-Precision Integer and Floating-Point Support

Advantages and Limitations of Arbitrary Precision Data Types

Declare and Use the AC Data Types

Declare the ac_int Data Type

Declare the ac_fixed Data Type

Declare the ac_complex Data Type

Declare the ap_float Data Type

Conversion Rules for ap_float

Operations with Explicit Precision Controls

Comparison Operators

Additional ap_float Functions

Additional Data Types Provided by the ap_float.hpp Header File

Quality of Results and the ap_float Data Type

Kernel Variable Accesses

FPGA Optimization Flags, Attributes, Pragmas, and Extensions

Optimization Flags

Specify Schedule F_MAX Target for Kernels (-Xsclock=<clock target>)

Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_type>)

Force Ring Interconnect for Global Memory (-Xsglobal-ring)

Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring)

Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder)

Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue)

Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking)

Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion)

Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion)

Pipeline Loops in Non-task Kernels (-Xsauto-pipeline)

Control Semantics of Floating-Point Operations (-fp-model=<var><value></var> )

Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>)

Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<var><value></var> )

Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<var><N></var>)

Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<var><option></var> )

Kernel Attributes

Specify Schedule F_MAX Target for Kernels

Specify a Workgroup Size

Specify Number of SIMD WorkItems

Omit Hardware that Generates and Dispatches Kernel IDs

Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels

Reduce Kernel Area and Latency

Kernel Controls

Pipes Extension

Key Properties of a Pipe

Accessing Pipes

The pipe Class and its Use

I/O Pipes

Characteristics of Pipes

Restrictions of Pipes

Guidelines for Designing Pipes

Pipe and Atomic Fence

Kernel Variables

Memory Attributes

Loop Directives

disable_loop_pipelining Attribute

initiation_interval Attribute

ivdep Attribute

loop_coalesce Attribute

max_concurrency Attribute

max_interleaving Attribute

speculated_iterations Attribute

unroll Pragma

Loop Fuse Functions and nofusion Attribute

Floating-Point Pragmas

Latency Controls (Beta)

System of Tasks Extension (task_sequence)

Task Functions

task_sequence Use Cases

Quick Reference

Algorithmic C Data Types

Floating Point Pragmas

FPGA Accessor Properties

FPGA Extensions

FPGA Kernel Attributes

FPGA Local Memory Function

Latency Control Properties (Beta)

FPGA LSU Controls

FPGA Loop Directives

FPGA Memory Attributes

FPGA Optimization Flags

Pipe API

task_sequence Template Parameters and Function APIs

Additional Information

Document Revision History for the FPGA Optimization Guide for Intel® oneAPI Toolkits

Notices and Disclaimers

Minimize the Memory Dependencies for Loop Pipelining

Intel® oneAPI DPC++/C++ Compiler ensures that the memory accesses from the same thread respects the program order. When you compile an NDRange kernel, use barriers to synchronize memory accesses across threads in the same workgroup.

Loop dependencies might introduce bottlenecks for single work-item kernels due to latency associated with the memory accesses. The Intel® oneAPI DPC++/C++ Compiler defers a memory operation until a dependent memory operation completes. This could affect the loop initiation interval (II). The Intel® oneAPI DPC++/C++ Compiler indicates the memory dependencies in the optimization report.

To minimize the impact of memory dependencies for loop pipelining:

Ensure that the Intel® oneAPI DPC++/C++ Compiler does not assume false dependencies.
When the static memory dependence analysis fails to prove that dependency does not exist, the Intel® oneAPI DPC++/C++ Compiler assumes that a dependency exists and modifies the kernel execution to enforce the dependency. The impact of the dependency enforcement is lower if the memory system is stall-free.
- Write-after-read operations with data dependency on a load-store unit can take just two clock cycles (II=2). Other stall-free scenarios can take up to seven clock cycles.
- The Intel® oneAPI DPC++/C++ Compiler can fully resolve the read-after-write (control dependency) operation.
Override the static memory dependence analysis by adding the line [[intel::ivdep]] before the loop in your kernel code if you are sure that it carries no dependencies. For more information, refer to ivdep Attribute

Parent topic: Loops

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA Optimization Guide for Intel® oneAPI Toolkits

Minimize the Memory Dependencies for Loop Pipelining