5.2.2. Unroll Loops

Intel® High Level Synthesis Compiler Pro Edition: Best Practices Guide

Download PDF

ID 683152

Date 12/13/2021

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Document Table of Contents x

1. Intel® HLS Compiler Pro Edition Best Practices Guide 2. Best Practices for Coding and Compiling Your Component 3. FPGA Concepts 4. Interface Best Practices 5. Loop Best Practices 6. fMAX Bottleneck Best Practices 7. Memory Architecture Best Practices 8. System of Tasks Best Practices 9. Datatype Best Practices 10. Advanced Troubleshooting A. Intel® HLS Compiler Pro Edition Best Practices Guide Archives B. Document Revision History for Intel® HLS Compiler Pro Edition Best Practices Guide

3. FPGA Concepts x

3.1. FPGA Architecture Overview 3.2. Concepts of FPGA Hardware Design 3.3. Methods of Hardware Design

3.1. FPGA Architecture Overview x

3.1.1. Adaptive Logic Module (ALM) 3.1.2. Digital Signal Processing (DSP) Block 3.1.3. Random-Access Memory (RAM) Blocks

3.1.1. Adaptive Logic Module (ALM) x

3.1.1.1. Lookup Table (LUT) 3.1.1.2. Register

3.2. Concepts of FPGA Hardware Design x

3.2.1. Maximum Frequency (fMAX) 3.2.2. Latency 3.2.3. Pipelining 3.2.4. Throughput 3.2.5. Datapath 3.2.6. Control Path 3.2.7. Occupancy

3.3. Methods of Hardware Design x

3.3.1. How Source Code Becomes a Custom Hardware Datapath 3.3.2. Scheduling 3.3.3. Mapping Parallelism Models to FPGA Hardware 3.3.4. Memory Types

3.3.1. How Source Code Becomes a Custom Hardware Datapath x

3.3.1.1. Mapping Source Code Instructions to Hardware 3.3.1.2. Mapping Arrays and Their Accesses to Hardware

3.3.2. Scheduling x

3.3.2.1. Dynamic Scheduling 3.3.2.2. Clustering the Datapath 3.3.2.3. Handshaking Between Clusters

3.3.3. Mapping Parallelism Models to FPGA Hardware x

3.3.3.1. Data Parallelism 3.3.3.2. Task Parallelism

3.3.3.1. Data Parallelism x

3.3.3.1.1. Executing Independent Operations Simultaneously 3.3.3.1.2. Pipelining

3.3.3.1.2. Pipelining x

3.3.3.1.2.1. Pipelining Loops Within A Component 3.3.3.1.2.2. Pipelining Across Component Invocations

3.3.4. Memory Types x

3.3.4.1. Component Memory 3.3.4.2. External Memory

4. Interface Best Practices x

4.1. Choose the Right Interface for Your Component 4.2. Control LSUs For Your Variable-Latency MM Host Interfaces 4.3. Avoid Pointer Aliasing

4.1. Choose the Right Interface for Your Component x

4.1.1. Pointer Interfaces 4.1.2. Avalon® Memory Mapped Host Interfaces 4.1.3. Avalon® Memory Mapped Agent Memories 4.1.4. Avalon® Memory Mapped Agent Registers 4.1.5. Avalon® Streaming Interfaces 4.1.6. Pass-by-Value Interface

5. Loop Best Practices x

5.1. Reuse Hardware By Calling It In a Loop 5.2. Parallelize Loops 5.3. Construct Well-Formed Loops 5.4. Minimize Loop-Carried Dependencies 5.5. Avoid Complex Loop-Exit Conditions 5.6. Convert Nested Loops into a Single Loop 5.7. Place if-Statements in the Lowest Possible Scope in a Loop Nest 5.8. Declare Variables in the Deepest Scope Possible 5.9. Raise Loop II to Increase fMAX 5.10. Control Loop Interleaving

5.2. Parallelize Loops x

5.2.1. Pipeline Loops 5.2.2. Unroll Loops 5.2.3. Example: Loop Pipelining and Unrolling

6. fMAX Bottleneck Best Practices x

6.1. Balancing Target fMAX and Target II

7. Memory Architecture Best Practices x

7.1. Example: Overriding a Coalesced Memory Architecture 7.2. Example: Overriding a Banked Memory Architecture 7.3. Merge Memories to Reduce Area 7.4. Example: Specifying Bank-Selection Bits for Local Memory Addresses

7.3. Merge Memories to Reduce Area x

7.3.1. Example: Merging Memories Depth-Wise 7.3.2. Example: Merging Memories Width-Wise

8. System of Tasks Best Practices x

8.1. Executing Multiple Loops in Parallel 8.2. Sharing an Expensive Compute Block 8.3. Implementing a Hierarchical Design 8.4. Balancing Capacity in a System of Tasks

8.4. Balancing Capacity in a System of Tasks x

8.4.1. Enable the Intel® HLS Compiler to Infer Data Path Buffer Capacity Requirements 8.4.2. Explicitly Add Buffer Capacity to Your Design When Needed

9. Datatype Best Practices x

9.1. Avoid Implicit Data Type Conversions 9.2. Avoid Negative Bit Shifts When Using the ac_int Datatype

10. Advanced Troubleshooting x

10.1. Component Fails Only In Simulation 10.2. Component Gets Poor Quality of Results

1. Intel® HLS Compiler Pro Edition Best Practices Guide

2. Best Practices for Coding and Compiling Your Component

3. FPGA Concepts

3.1. FPGA Architecture Overview

3.1.1. Adaptive Logic Module (ALM)

3.1.1.1. Lookup Table (LUT)

3.1.1.2. Register

3.1.2. Digital Signal Processing (DSP) Block

3.1.3. Random-Access Memory (RAM) Blocks

3.2. Concepts of FPGA Hardware Design

3.2.1. Maximum Frequency (fMAX)

3.2.2. Latency

3.2.3. Pipelining

3.2.4. Throughput

3.2.5. Datapath

3.2.6. Control Path

3.2.7. Occupancy

3.3. Methods of Hardware Design

3.3.1. How Source Code Becomes a Custom Hardware Datapath

3.3.1.1. Mapping Source Code Instructions to Hardware

3.3.1.2. Mapping Arrays and Their Accesses to Hardware

3.3.2. Scheduling

3.3.2.1. Dynamic Scheduling

3.3.2.2. Clustering the Datapath

3.3.2.3. Handshaking Between Clusters

3.3.3. Mapping Parallelism Models to FPGA Hardware

3.3.3.1. Data Parallelism

3.3.3.1.1. Executing Independent Operations Simultaneously

3.3.3.1.2. Pipelining

3.3.3.1.2.1. Pipelining Loops Within A Component

3.3.3.1.2.2. Pipelining Across Component Invocations

3.3.3.2. Task Parallelism

3.3.4. Memory Types

3.3.4.1. Component Memory

3.3.4.2. External Memory

4. Interface Best Practices

4.1. Choose the Right Interface for Your Component

4.1.1. Pointer Interfaces

4.1.2. Avalon® Memory Mapped Host Interfaces

4.1.3. Avalon® Memory Mapped Agent Memories

4.1.4. Avalon® Memory Mapped Agent Registers

4.1.5. Avalon® Streaming Interfaces

4.1.6. Pass-by-Value Interface

4.2. Control LSUs For Your Variable-Latency MM Host Interfaces

4.3. Avoid Pointer Aliasing

5. Loop Best Practices

5.1. Reuse Hardware By Calling It In a Loop

5.2. Parallelize Loops

5.2.1. Pipeline Loops

5.2.2. Unroll Loops

5.2.3. Example: Loop Pipelining and Unrolling

5.3. Construct Well-Formed Loops

5.4. Minimize Loop-Carried Dependencies

5.5. Avoid Complex Loop-Exit Conditions

5.6. Convert Nested Loops into a Single Loop

5.7. Place if-Statements in the Lowest Possible Scope in a Loop Nest

5.8. Declare Variables in the Deepest Scope Possible

5.9. Raise Loop II to Increase fMAX

5.10. Control Loop Interleaving

6. fMAX Bottleneck Best Practices

6.1. Balancing Target fMAX and Target II

7. Memory Architecture Best Practices

7.1. Example: Overriding a Coalesced Memory Architecture

7.2. Example: Overriding a Banked Memory Architecture

7.3. Merge Memories to Reduce Area

7.3.1. Example: Merging Memories Depth-Wise

7.3.2. Example: Merging Memories Width-Wise

7.4. Example: Specifying Bank-Selection Bits for Local Memory Addresses

8. System of Tasks Best Practices

8.1. Executing Multiple Loops in Parallel

8.2. Sharing an Expensive Compute Block

8.3. Implementing a Hierarchical Design

8.4. Balancing Capacity in a System of Tasks

8.4.1. Enable the Intel® HLS Compiler to Infer Data Path Buffer Capacity Requirements

8.4.2. Explicitly Add Buffer Capacity to Your Design When Needed

9. Datatype Best Practices

9.1. Avoid Implicit Data Type Conversions

9.2. Avoid Negative Bit Shifts When Using the ac_int Datatype

10. Advanced Troubleshooting

10.1. Component Fails Only In Simulation

10.2. Component Gets Poor Quality of Results

A. Intel® HLS Compiler Pro Edition Best Practices Guide Archives

B. Document Revision History for Intel® HLS Compiler Pro Edition Best Practices Guide

5.2.2. Unroll Loops

When a loop is unrolled, each iteration of the loop is replicated in hardware and executes simultaneously if the iterations are independent. Unrolling loops trades an increase in FPGA area use for a reduction in the latency of your component.

Consider the following basic loop with three stages and three iterations. Each stage represents the operations that occur in the loop within one clock cycle.

Figure 31. Basic loop with three stages and three iterations

If each stage of this loop takes one clock cycle to execute, then this loop has a latency of nine cycles.

The following figure shows the loop from Figure 31 unrolled three times.

Figure 32. Unrolled loop with three stages and three iterations

Three iterations of the loop can now be completed in only three clock cycles, but three times as many hardware resources are required.

You can control how the compiler unrolls a loop with the #pragma unroll directive, but this directive works only if the compiler knows the trip count for the loop in advance or if you specify the unroll factor. In addition to replicating the hardware, the compiler also reschedules the circuit such that each operation runs as soon as the inputs for the operation are ready.

For an example of using the #pragma unroll directive, see the best_practices/resource_sharing_filter tutorial.

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® High Level Synthesis Compiler Pro Edition: Best Practices Guide

5.2.2. Unroll Loops