Load-Store Unit Modifiers

Developer Guide

FPGA Optimization Guide for Intel® oneAPI Toolkits

Download PDF

ID 767853

Date 12/16/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-671BD385-1A6E-4E74-B77F-FC4DCC349A90

View Details

Document Table of Contents

Document Table of Contents x

FPGA Optimization Guide for Intel® oneAPI Toolkits

FPGA Optimization Guide for Intel® oneAPI Toolkits x

Introduction To FPGA Design Concepts Analyze Your Design Optimize Your Design FPGA Optimization Flags, Attributes, Pragmas, and Extensions Quick Reference Additional Information Document Revision History for the FPGA Optimization Guide for Intel® oneAPI Toolkits Notices and Disclaimers

Introduction To FPGA Design Concepts x

FPGA Architecture Overview Concepts of FPGA Hardware Design Methods of Hardware Design How Source Code Becomes a Custom Hardware Datapath Scheduling Mapping Parallelism Models to FPGA Hardware Memory Types

FPGA Architecture Overview x

Adaptive Logic Module (ALM) Lookup Table (LUT) Register Digital Signal Processing (DSP) Block Random Access Memory (RAM) Blocks

Concepts of FPGA Hardware Design x

Maximum Frequency (f_MAX) Latency Pipelining Throughput Datapath Control Path Occupancy

How Source Code Becomes a Custom Hardware Datapath x

Mapping Source Code Instructions to Hardware Mapping Arrays and Their Accesses to Hardware

Scheduling x

Dynamic Scheduling Clustering the Datapath Handshaking Between Clusters

Mapping Parallelism Models to FPGA Hardware x

Data Parallelism Task Parallelism

Data Parallelism x

Executing Independent Operations Simultaneously Pipelining

Memory Types x

Kernel Memory Global Memory

Analyze Your Design x

Analyze the FPGA Early Image Analyze the FPGA Image

Analyze the FPGA Early Image x

Review the FPGA Optimization Report Access HLD FPGA Reports in JSON Format

Review the FPGA Optimization Report x

Loop Analysis Bottlenecks Viewer Area Estimates System Viewer Kernel Memory Viewer Schedule Viewer

Analyze the FPGA Image x

Quartus (Static) Summary Intel® FPGA Dynamic Profiler for DPC++ System-level Profiling Using the Intercept Layer for OpenCL* Applications

Quartus (Static) Summary x

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++ x

Measure Kernel Performance Instrument the Kernel Pipeline with Performance Counters (-Xsprofile) Obtain Profiling Data During Runtime Reduce Area Resource Use While Profiling Profiler Analyses of Example SYCL* Design Scenarios Limitations

Obtain Profiling Data During Runtime x

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data Use Intel® VTune™ Profiler

Use Intel® VTune™ Profiler x

Interpret Performance Counter Data

System-level Profiling Using the Intercept Layer for OpenCL* Applications x

Set Up the Intercept Layer for OpenCL* Applications

Optimize Your Design x

Throughput Resource Use

Throughput x

Single Work-item Kernels NDRange Kernels Memory Accesses Pipes Host

Single Work-item Kernels x

Single Work-item Kernel Design Guidelines Loops Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Loops x

Refactor the Loop-Carried Data Dependency Relax Loop-Carried Dependency Transfer Loop-Carried Dependency to Local Memory Minimize the Memory Dependencies for Loop Pipelining Unroll Loops Fuse Loops to Reduce Overhead and Improve Performance Optimize Loops With Loop Speculation Remove Loop Bottlenecks Shannonization to Improve F_MAX/II Optimize Inner Loop Throughput Improve Loop Performance by Caching On-Chip Memory

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels x

Strategies for Inferring the Accumulator

Memory Accesses x

Load-Store Units Global Memory Accesses Optimization Perform Kernel Computations Using Local or Private Memory Local and Private Memory Accesses Optimization Annotating Unified Shared Memory Pointers Zero-Copy Memory Access Additional Recommendations

Load-Store Units x

Load-Store Unit Styles Load-Store Unit Modifiers Cached Write-Acknowledge (write-ack) Nonaligned Never-stall Load-Store Unit Controls

Global Memory Accesses Optimization x

Global Memory Bandwidth Use Calculation Manual Partition of Global Memory Partitioning Buffers Across Different Memory Types (Heterogeneous Memory) Partitioning Buffers Across Memory Channels of the Same Memory Type Ignoring Dependencies Between Accessor Arguments Contiguous Memory Accesses Static Memory Coalescing

Host x

Multi-Threaded Host Application Utilizing Hardware Kernel Invocation Queue Double Buffering Host Utilizing Kernel Invocation Queue N-Way Buffering to Overlap Kernel Execution Prepinning Memory Simple Host-Device Streaming Buffered Host-Device Streaming

Double Buffering Host Utilizing Kernel Invocation Queue x

Applying Double-Buffering Using the Intercept Layer for OpenCL* Applications

Resource Use x

Data Types and Operations Kernel Variable Accesses

Data Types and Operations x

Optimize Floating-point Operation Avoid Expensive Functions Variable-Precision Integer and Floating-Point Support

Variable-Precision Integer and Floating-Point Support x

Advantages and Limitations of Arbitrary Precision Data Types Declare and Use the AC Data Types

Declare and Use the AC Data Types x

Declare the ac_int Data Type Declare the ac_fixed Data Type Declare the ac_complex Data Type Declare the ap_float Data Type

Declare the ap_float Data Type x

Conversion Rules for ap_float Operations with Explicit Precision Controls Comparison Operators Additional ap_float Functions Additional Data Types Provided by the ap_float.hpp Header File Quality of Results and the ap_float Data Type

FPGA Optimization Flags, Attributes, Pragmas, and Extensions x

Optimization Flags Kernel Attributes Kernel Controls Kernel Variables Memory Attributes Loop Directives Floating-Point Pragmas Latency Controls (Beta) System of Tasks Extension (task_sequence)

Optimization Flags x

Specify Schedule F_MAX Target for Kernels (-Xsclock=<clock target>) Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_type>) Force Ring Interconnect for Global Memory (-Xsglobal-ring) Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring) Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder) Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue) Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking) Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion) Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion) Pipeline Loops in Non-task Kernels (-Xsauto-pipeline) Control Semantics of Floating-Point Operations (-fp-model=<var><value></var> ) Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>) Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<var><value></var> ) Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<var><N></var>) Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<var><option></var> )

Kernel Attributes x

Specify Schedule F_MAX Target for Kernels Specify a Workgroup Size Specify Number of SIMD WorkItems Omit Hardware that Generates and Dispatches Kernel IDs Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels Reduce Kernel Area and Latency

Kernel Controls x

Pipes Extension

Pipes Extension x

Key Properties of a Pipe Accessing Pipes The pipe Class and its Use I/O Pipes Characteristics of Pipes Restrictions of Pipes Guidelines for Designing Pipes Pipe and Atomic Fence

Loop Directives x

disable_loop_pipelining Attribute initiation_interval Attribute ivdep Attribute loop_coalesce Attribute max_concurrency Attribute max_interleaving Attribute speculated_iterations Attribute unroll Pragma Loop Fuse Functions and nofusion Attribute

System of Tasks Extension (task_sequence) x

Task Functions task_sequence Use Cases

Quick Reference x

Algorithmic C Data Types Floating Point Pragmas FPGA Accessor Properties FPGA Extensions FPGA Kernel Attributes FPGA Local Memory Function Latency Control Properties (Beta) FPGA LSU Controls FPGA Loop Directives FPGA Memory Attributes FPGA Optimization Flags Pipe API task_sequence Template Parameters and Function APIs

FPGA Optimization Guide for Intel® oneAPI Toolkits

Introduction To FPGA Design Concepts

FPGA Architecture Overview

Adaptive Logic Module (ALM)

Lookup Table (LUT)

Digital Signal Processing (DSP) Block

Random Access Memory (RAM) Blocks

Concepts of FPGA Hardware Design

Maximum Frequency (f_MAX)

Latency

Pipelining

Throughput

Datapath

Control Path

Occupancy

Methods of Hardware Design

How Source Code Becomes a Custom Hardware Datapath

Mapping Source Code Instructions to Hardware

Mapping Arrays and Their Accesses to Hardware

Scheduling

Dynamic Scheduling

Clustering the Datapath

Handshaking Between Clusters

Mapping Parallelism Models to FPGA Hardware

Data Parallelism

Executing Independent Operations Simultaneously

Pipelining

Task Parallelism

Memory Types

Kernel Memory

Global Memory

Analyze Your Design

Analyze the FPGA Early Image

Review the FPGA Optimization Report

Loop Analysis

Bottlenecks Viewer

Area Estimates

System Viewer

Kernel Memory Viewer

Schedule Viewer

Access HLD FPGA Reports in JSON Format

Analyze the FPGA Image

Quartus (Static) Summary

Timing Failures

Intel® FPGA Dynamic Profiler for DPC++

Measure Kernel Performance

Instrument the Kernel Pipeline with Performance Counters (-Xsprofile)

Obtain Profiling Data During Runtime

Invoke the Profiler Runtime Wrapper to Obtain Profiling Data

Use Intel® VTune™ Profiler

Interpret Performance Counter Data

Reduce Area Resource Use While Profiling

Profiler Analyses of Example SYCL* Design Scenarios

Limitations

System-level Profiling Using the Intercept Layer for OpenCL* Applications

Set Up the Intercept Layer for OpenCL* Applications

Optimize Your Design

Throughput

Single Work-item Kernels

Single Work-item Kernel Design Guidelines

Loops

Refactor the Loop-Carried Data Dependency

Relax Loop-Carried Dependency

Transfer Loop-Carried Dependency to Local Memory

Minimize the Memory Dependencies for Loop Pipelining

Unroll Loops

Fuse Loops to Reduce Overhead and Improve Performance

Optimize Loops With Loop Speculation

Remove Loop Bottlenecks

Shannonization to Improve F_MAX/II

Optimize Inner Loop Throughput

Improve Loop Performance by Caching On-Chip Memory

Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels

Strategies for Inferring the Accumulator

NDRange Kernels

Memory Accesses

Load-Store Units

Load-Store Unit Styles

Load-Store Unit Modifiers

Cached
Write-Acknowledge (write-ack)
Nonaligned
Never-stall

Load-Store Unit Controls

Global Memory Accesses Optimization

Global Memory Bandwidth Use Calculation

Manual Partition of Global Memory

Partitioning Buffers Across Different Memory Types (Heterogeneous Memory)

Partitioning Buffers Across Memory Channels of the Same Memory Type

Ignoring Dependencies Between Accessor Arguments

Contiguous Memory Accesses

Static Memory Coalescing

Perform Kernel Computations Using Local or Private Memory

Local and Private Memory Accesses Optimization

Annotating Unified Shared Memory Pointers

Zero-Copy Memory Access

Additional Recommendations

Pipes

Host

Multi-Threaded Host Application

Utilizing Hardware Kernel Invocation Queue

Double Buffering Host Utilizing Kernel Invocation Queue

Applying Double-Buffering Using the Intercept Layer for OpenCL* Applications

N-Way Buffering to Overlap Kernel Execution

Prepinning Memory

Simple Host-Device Streaming

Buffered Host-Device Streaming

Resource Use

Data Types and Operations

Optimize Floating-point Operation

Avoid Expensive Functions

Variable-Precision Integer and Floating-Point Support

Advantages and Limitations of Arbitrary Precision Data Types

Declare and Use the AC Data Types

Declare the ac_int Data Type

Declare the ac_fixed Data Type

Declare the ac_complex Data Type

Declare the ap_float Data Type

Conversion Rules for ap_float

Operations with Explicit Precision Controls

Comparison Operators

Additional ap_float Functions

Additional Data Types Provided by the ap_float.hpp Header File

Quality of Results and the ap_float Data Type

Kernel Variable Accesses

FPGA Optimization Flags, Attributes, Pragmas, and Extensions

Optimization Flags

Specify Schedule F_MAX Target for Kernels (-Xsclock=<clock target>)

Disable Burst-Interleaving of Global Memory (-Xsno-interleaving=<global_memory_type>)

Force Ring Interconnect for Global Memory (-Xsglobal-ring)

Force a Single Store Ring to Reduce Area (-Xsforce-single-store-ring)

Force Fewer Read Data Reorder Units to Reduce Area (-Xsnum-reorder)

Disable Hardware Kernel Invocation Queue (-Xsno-hardware-kernel-invocation-queue)

Modify the Handshaking Protocol Between Clusters (-Xshyper-optimized-handshaking)

Disable Automatic Fusion of Loops (-Xsdisable-auto-loop-fusion)

Fuse Adjacent Loops With Unequal Trip Counts (-Xsenable-unequal-tc-fusion)

Pipeline Loops in Non-task Kernels (-Xsauto-pipeline)

Control Semantics of Floating-Point Operations (-fp-model=<var><value></var> )

Modify the Rounding Mode of Floating-point Operations (-Xsrounding=<rounding_type>)

Global Control of Exit FIFO Latency of Stall-free Clusters (-Xssfc-exit-fifo-type=<var><value></var> )

Enable the Read-Only Cache for Read-Only Accessors (-Xsread-only-cache-size=<var><N></var>)

Control Hardware Implementation of the Supported Data Types and Math Operations (-Xsdsp-mode=<var><option></var> )

Kernel Attributes

Specify Schedule F_MAX Target for Kernels

Specify a Workgroup Size

Specify Number of SIMD WorkItems

Omit Hardware that Generates and Dispatches Kernel IDs

Omit Hardware to Support the no_global_work_offset Attribute in parallel_for Kernels

Reduce Kernel Area and Latency

Kernel Controls

Pipes Extension

Key Properties of a Pipe

Accessing Pipes

The pipe Class and its Use

I/O Pipes

Characteristics of Pipes

Restrictions of Pipes

Guidelines for Designing Pipes

Pipe and Atomic Fence

Kernel Variables

Memory Attributes

Loop Directives

disable_loop_pipelining Attribute

initiation_interval Attribute

ivdep Attribute

loop_coalesce Attribute

max_concurrency Attribute

max_interleaving Attribute

speculated_iterations Attribute

unroll Pragma

Loop Fuse Functions and nofusion Attribute

Floating-Point Pragmas

Latency Controls (Beta)

System of Tasks Extension (task_sequence)

Task Functions

task_sequence Use Cases

Quick Reference

Algorithmic C Data Types

Floating Point Pragmas

FPGA Accessor Properties

FPGA Extensions

FPGA Kernel Attributes

FPGA Local Memory Function

Latency Control Properties (Beta)

FPGA LSU Controls

FPGA Loop Directives

FPGA Memory Attributes

FPGA Optimization Flags

Pipe API

task_sequence Template Parameters and Function APIs

Additional Information

Document Revision History for the FPGA Optimization Guide for Intel® oneAPI Toolkits

Notices and Disclaimers

Visible to Intel only — GUID: GUID-671BD385-1A6E-4E74-B77F-FC4DCC349A90

View Details

Load-Store Unit Modifiers

Depending on the memory access pattern in your kernel, the compiler modifies some LSUs.

Cached

Burst-coalesced LSUs might sometimes include a cache. A cache is created when the memory access pattern is data-dependent or appears to be repetitive. The cache cannot be shared with other loads even if the loads want the same data. The cache is flushed on kernel start and consumes more hardware resources than an equivalent LSU without a cache. The cache is inferred only for non-volatile global pointers.

Write-Acknowledge (write-ack)

Burst-coalesced store LSUs sometimes require a write-acknowledgment signal when data dependencies exist. LSUs with a write-acknowledge signal require additional hardware resources. Throughput might be reduced if multiple write-acknowledge LSUs access the same memory.

Nonaligned

When a burst-coalesced LSU can access memory that is not aligned to the external memory word size, a nonaligned LSU is created. Additional hardware resources are required to implement a nonaligned LSU. The throughput of a nonaligned LSU might be reduced if it receives many unaligned requests.

Never-stall

If a pipelined LSU is connected to a local memory without arbitration, a never-stall LSU is created because all accesses to the memory take a fixed number of cycles that are known to the compiler.

Parent topic: Load-Store Units

Level Two Title

Load-Store Unit Styles Load-Store Unit Controls

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA Optimization Guide for Intel® oneAPI Toolkits

Load-Store Unit Modifiers

Cached

Write-Acknowledge (write-ack)

Nonaligned

Never-stall