2.3.1. Conventional Versus Hyper-Pipelining

Hyperflex® Architecture High-Performance Design Handbook

Download PDF

ID 683353

Date 7/07/2025

Version

Public

Document Table of Contents

Document Table of Contents x

Answers to Top FAQs 1. Hyperflex® FPGA Architecture Introduction 2. Hyperflex® Architecture RTL Design Guidelines 3. Compiling Hyperflex® Architecture Designs 4. Design Example Walk-Through 5. Retiming Restrictions and Workarounds 6. Optimization Example 7. Hyperflex® Architecture Porting Guidelines 8. Appendices 9. Hyperflex® Architecture High-Performance Design Handbook Archive 10. Hyperflex® Architecture High-Performance Design Handbook Revision History

1. Hyperflex® FPGA Architecture Introduction x

1.1. Hyperflex® Architecture Design Concepts

2. Hyperflex® Architecture RTL Design Guidelines x

2.1. High-Speed Design Methodology 2.2. Hyper-Retiming (Facilitate Register Movement) 2.3. Hyper-Pipelining (Add Pipeline Registers) 2.4. Hyper-Optimization (Optimize RTL)

2.1. High-Speed Design Methodology x

2.1.1. Set a High-Speed Target 2.1.2. Experiment and Iterate 2.1.3. Compile Components Independently 2.1.4. Optimize Sub-Modules 2.1.5. Avoid Broadcast Signals

2.1.1. Set a High-Speed Target x

2.1.1.1. Speed and Timing Closure 2.1.1.2. Speed and Latency

2.2. Hyper-Retiming (Facilitate Register Movement) x

2.2.1. Reset Strategies 2.2.2. Clock Enable Strategies 2.2.3. Preserving Registers During Synthesis 2.2.4. Timing Constraint Considerations 2.2.5. Clock Synchronization Strategies 2.2.6. Metastability Synchronizers 2.2.7. Initial Power-Up Conditions 2.2.8. Retiming through RAMs and DSPs

2.2.1. Reset Strategies x

2.2.1.1. Removing Asynchronous Resets 2.2.1.2. Synchronous Resets on Global Clock Trees 2.2.1.3. Synchronous Resets on I/O Ports 2.2.1.4. Duplicate and Pipeline Synchronous Resets

2.2.2. Clock Enable Strategies x

2.2.2.1. Localized Clock Enable 2.2.2.2. High Fan-Out Clock Enable 2.2.2.3. Clock Enable with Timing Exceptions

2.2.4. Timing Constraint Considerations x

2.2.4.1. Optimize Multicycle Paths 2.2.4.2. Overconstraints

2.2.5. Clock Synchronization Strategies x

2.2.5.1. Clock Domain Crossing Constraint Guidelines

2.2.7. Initial Power-Up Conditions x

2.2.7.1. Specifying Initial Memory Conditions 2.2.7.2. Initial Conditions and Retiming 2.2.7.3. Initial Conditions and Hyper-Registers 2.2.7.4. Retiming Reset Sequences

2.2.7.3. Initial Conditions and Hyper-Registers x

2.2.7.3.1. Implementing Clock Gating 2.2.7.3.2. Quartus® Prime Settings for Initial Conditions

2.3. Hyper-Pipelining (Add Pipeline Registers) x

2.3.1. Conventional Versus Hyper-Pipelining 2.3.2. Pipelining and Latency 2.3.3. Use Registers Instead of Multicycle Exceptions

2.3.2. Pipelining and Latency x

2.3.2.1. Pipelining at Variable Latency Locations 2.3.2.2. Automatic Pipeline Insertion

2.3.2.1. Pipelining at Variable Latency Locations x

2.3.2.1.1. Specifying a Latency-Insensitive False Path

2.3.2.2. Automatic Pipeline Insertion x

2.3.2.2.1. Step 1: Create the Variable Latency Module 2.3.2.2.2. Step 2: Instantiate the Variable Latency Module 2.3.2.2.3. Step 3: Verify Automatic Pipeline Insertion Option 2.3.2.2.4. (Optional) Auto-Pipeline Insertion without a Variable Latency Module

2.4. Hyper-Optimization (Optimize RTL) x

2.4.1. General Optimization Techniques 2.4.2. Optimizing Specific Design Structures

2.4.1. General Optimization Techniques x

2.4.1.1. Shannon’s Decomposition 2.4.1.2. Time Domain Multiplexing 2.4.1.3. Loop Unrolling 2.4.1.4. Loop Pipelining 2.4.1.5. Precomputation

2.4.1.1. Shannon’s Decomposition x

2.4.1.1.1. Shannon’s Decomposition Example 2.4.1.1.2. Identifying Circuits for Shannon’s Decomposition

2.4.1.4. Loop Pipelining x

2.4.1.4.1. Loop Pipelining Theory 2.4.1.4.2. Loop Pipelining Demonstration 2.4.1.4.3. Loop Pipelining and Synthesis Optimization

2.4.2. Optimizing Specific Design Structures x

2.4.2.1. High-Speed Clock Domains 2.4.2.2. Restructuring Loops 2.4.2.3. Control Signal Backpressure 2.4.2.4. Flow Control with FIFO Status Signals 2.4.2.5. Flow Control with Skid Buffers 2.4.2.6. Read-Modify-Write Memory 2.4.2.7. Counters and Accumulators 2.4.2.8. State Machines 2.4.2.9. Memory 2.4.2.10. DSP Blocks 2.4.2.11. General Logic 2.4.2.12. Modulus and Division 2.4.2.13. Resets 2.4.2.14. Hardware Re-use 2.4.2.15. Algorithmic Requirements 2.4.2.16. FIFOs 2.4.2.17. Ternary Adders

2.4.2.1. High-Speed Clock Domains x

2.4.2.1.1. Visualizing Clock Networks 2.4.2.1.2. Viewing Clock Networks in the Fitter Report 2.4.2.1.3. Viewing Clocks in the Timing Analyzer

2.4.2.9. Memory x

2.4.2.9.1. Hyperflex® Architecture True Dual-Port Memory 2.4.2.9.2. Use Simple Dual-Port Memories 2.4.2.9.3. Hyperflex® Architecture Simple Dual-Port Memory Example 2.4.2.9.4. Memory Mixed Port Width Ratio Limits 2.4.2.9.5. Unregistered RAM Outputs

3. Compiling Hyperflex® Architecture Designs x

3.1. Compiling Submodules Independently 3.2. Design Assistant Design Rule Checking

3.2. Design Assistant Design Rule Checking x

3.2.1. Running Design Assistant During Compilation 3.2.2. Running Design Assistant in Analysis Mode

3.2.2. Running Design Assistant in Analysis Mode x

3.2.2.1. Cross-Probing from Design Assistant to Visualization Tools 3.2.2.2. Launching Design Assistant from Chip Planner 3.2.2.3. Launching Design Assistant from Timing Analyzer

4. Design Example Walk-Through x

4.1. Median Filter Design Example

4.1. Median Filter Design Example x

4.1.1. Step 1: Compile the Base Design 4.1.2. Step 2: Add Pipeline Stages and Remove Asynchronous Resets 4.1.3. Step 3: Add More Pipeline Stages and Remove All Asynchronous Resets 4.1.4. Step 4: Optimize Short Path and Long Path Conditions

5. Retiming Restrictions and Workarounds x

5.1. Setting the dont_merge Synthesis Attribute 5.2. Interpreting Critical Chain Reports

5.2. Interpreting Critical Chain Reports x

5.2.1. Insufficient Registers 5.2.2. Short Path/Long Path 5.2.3. Fast Forward Limit 5.2.4. Loops 5.2.5. One Critical Chain per Clock Domain 5.2.6. Critical Chains in Related Clock Groups 5.2.7. Complex Critical Chains 5.2.8. Extend to locatable node 5.2.9. Domain Boundary Entry and Domain Boundary Exit 5.2.10. Critical Chains with Dual Clock Memories 5.2.11. Critical Chain Bits and Buses 5.2.12. Delay Lines

5.2.1. Insufficient Registers x

5.2.1.1. Insufficient Registers Example 5.2.1.2. Optimizing Insufficient Registers 5.2.1.3. Critical Chains with Dual Clock Memories

5.2.2. Short Path/Long Path x

5.2.2.1. Hyper-Register Locations Not Available 5.2.2.2. Example for Hold Optimization 5.2.2.3. Optimizing Short Path/Long Path 5.2.2.4. Add Registers 5.2.2.5. Duplicate Common Nodes 5.2.2.6. Data and Control Plane

5.2.3. Fast Forward Limit x

5.2.3.1. Optimizing Path Limit

5.2.4. Loops x

5.2.4.1. Example of Loops Limiting the Critical Chain

6. Optimization Example x

6.1. Round Robin Scheduler

7. Hyperflex® Architecture Porting Guidelines x

7.1. Design Migration and Performance Exploration 7.2. Top-Level Design Considerations

7.1. Design Migration and Performance Exploration x

7.1.1. Black-boxing Verilog HDL Modules 7.1.2. Black-boxing VHDL Modules 7.1.3. Clock Management 7.1.4. Pin Assignments 7.1.5. Transceiver Control Logic 7.1.6. Upgrade Outdated IP Cores

8. Appendices x

8.1. Appendix A: Parameterizable Pipeline Modules 8.2. Appendix B: Clock Enables and Resets

8.2. Appendix B: Clock Enables and Resets x

8.2.1. Synchronous Resets and Limitations 8.2.2. Retiming with Clock Enables 8.2.3. Resolving Short Paths

8.2.1. Synchronous Resets and Limitations x

8.2.1.1. Synchronous Resets Summary

8.2.2. Retiming with Clock Enables x

8.2.2.1. Example for Broadcast Control Signals

Answers to Top FAQs

1. Hyperflex® FPGA Architecture Introduction

1.1. Hyperflex® Architecture Design Concepts

2. Hyperflex® Architecture RTL Design Guidelines

2.1. High-Speed Design Methodology

2.1.1. Set a High-Speed Target

2.1.1.1. Speed and Timing Closure

2.1.1.2. Speed and Latency

2.1.2. Experiment and Iterate

2.1.3. Compile Components Independently

2.1.4. Optimize Sub-Modules

2.1.5. Avoid Broadcast Signals

2.2. Hyper-Retiming (Facilitate Register Movement)

2.2.1. Reset Strategies

2.2.1.1. Removing Asynchronous Resets

2.2.1.2. Synchronous Resets on Global Clock Trees

2.2.1.3. Synchronous Resets on I/O Ports

2.2.1.4. Duplicate and Pipeline Synchronous Resets

2.2.2. Clock Enable Strategies

2.2.2.1. Localized Clock Enable

2.2.2.2. High Fan-Out Clock Enable

2.2.2.3. Clock Enable with Timing Exceptions

2.2.3. Preserving Registers During Synthesis

2.2.4. Timing Constraint Considerations

2.2.4.1. Optimize Multicycle Paths

2.2.4.2. Overconstraints

2.2.5. Clock Synchronization Strategies

2.2.5.1. Clock Domain Crossing Constraint Guidelines

2.2.6. Metastability Synchronizers

2.2.7. Initial Power-Up Conditions

2.2.7.1. Specifying Initial Memory Conditions

2.2.7.2. Initial Conditions and Retiming

2.2.7.3. Initial Conditions and Hyper-Registers

2.2.7.3.1. Implementing Clock Gating

2.2.7.3.2. Quartus® Prime Settings for Initial Conditions

2.2.7.4. Retiming Reset Sequences

2.2.8. Retiming through RAMs and DSPs

2.3. Hyper-Pipelining (Add Pipeline Registers)

2.3.1. Conventional Versus Hyper-Pipelining

2.3.2. Pipelining and Latency

2.3.2.1. Pipelining at Variable Latency Locations

2.3.2.1.1. Specifying a Latency-Insensitive False Path

2.3.2.2. Automatic Pipeline Insertion

2.3.2.2.1. Step 1: Create the Variable Latency Module

2.3.2.2.2. Step 2: Instantiate the Variable Latency Module

2.3.2.2.3. Step 3: Verify Automatic Pipeline Insertion Option

2.3.2.2.4. (Optional) Auto-Pipeline Insertion without a Variable Latency Module

2.3.3. Use Registers Instead of Multicycle Exceptions

2.4. Hyper-Optimization (Optimize RTL)

2.4.1. General Optimization Techniques

2.4.1.1. Shannon’s Decomposition

2.4.1.1.1. Shannon’s Decomposition Example

2.4.1.1.2. Identifying Circuits for Shannon’s Decomposition

2.4.1.2. Time Domain Multiplexing

2.4.1.3. Loop Unrolling

2.4.1.4. Loop Pipelining

2.4.1.4.1. Loop Pipelining Theory

2.4.1.4.2. Loop Pipelining Demonstration

2.4.1.4.3. Loop Pipelining and Synthesis Optimization

2.4.1.5. Precomputation

2.4.2. Optimizing Specific Design Structures

2.4.2.1. High-Speed Clock Domains

2.4.2.1.1. Visualizing Clock Networks

2.4.2.1.2. Viewing Clock Networks in the Fitter Report

2.4.2.1.3. Viewing Clocks in the Timing Analyzer

2.4.2.2. Restructuring Loops

2.4.2.3. Control Signal Backpressure

2.4.2.4. Flow Control with FIFO Status Signals

2.4.2.5. Flow Control with Skid Buffers

2.4.2.6. Read-Modify-Write Memory

2.4.2.7. Counters and Accumulators

2.4.2.8. State Machines

2.4.2.9. Memory

2.4.2.9.1. Hyperflex® Architecture True Dual-Port Memory

2.4.2.9.2. Use Simple Dual-Port Memories

2.4.2.9.3. Hyperflex® Architecture Simple Dual-Port Memory Example

2.4.2.9.4. Memory Mixed Port Width Ratio Limits

2.4.2.9.5. Unregistered RAM Outputs

2.4.2.10. DSP Blocks

2.4.2.11. General Logic

2.4.2.12. Modulus and Division

2.4.2.13. Resets

2.4.2.14. Hardware Re-use

2.4.2.15. Algorithmic Requirements

2.4.2.16. FIFOs

2.4.2.17. Ternary Adders

3. Compiling Hyperflex® Architecture Designs

3.1. Compiling Submodules Independently

3.2. Design Assistant Design Rule Checking

3.2.1. Running Design Assistant During Compilation

3.2.2. Running Design Assistant in Analysis Mode

3.2.2.1. Cross-Probing from Design Assistant to Visualization Tools

3.2.2.2. Launching Design Assistant from Chip Planner

3.2.2.3. Launching Design Assistant from Timing Analyzer

4. Design Example Walk-Through

4.1. Median Filter Design Example

4.1.1. Step 1: Compile the Base Design

4.1.2. Step 2: Add Pipeline Stages and Remove Asynchronous Resets

4.1.3. Step 3: Add More Pipeline Stages and Remove All Asynchronous Resets

4.1.4. Step 4: Optimize Short Path and Long Path Conditions

5. Retiming Restrictions and Workarounds

5.1. Setting the dont_merge Synthesis Attribute

5.2. Interpreting Critical Chain Reports

5.2.1. Insufficient Registers

5.2.1.1. Insufficient Registers Example

5.2.1.2. Optimizing Insufficient Registers

5.2.1.3. Critical Chains with Dual Clock Memories

5.2.2. Short Path/Long Path

5.2.2.1. Hyper-Register Locations Not Available

5.2.2.2. Example for Hold Optimization

5.2.2.3. Optimizing Short Path/Long Path

5.2.2.4. Add Registers

5.2.2.5. Duplicate Common Nodes

5.2.2.6. Data and Control Plane

5.2.3. Fast Forward Limit

5.2.3.1. Optimizing Path Limit

5.2.4. Loops

5.2.4.1. Example of Loops Limiting the Critical Chain

5.2.5. One Critical Chain per Clock Domain

5.2.6. Critical Chains in Related Clock Groups

5.2.7. Complex Critical Chains

5.2.8. Extend to locatable node

5.2.9. Domain Boundary Entry and Domain Boundary Exit

5.2.10. Critical Chains with Dual Clock Memories

5.2.11. Critical Chain Bits and Buses

5.2.12. Delay Lines

6. Optimization Example

6.1. Round Robin Scheduler

7. Hyperflex® Architecture Porting Guidelines

7.1. Design Migration and Performance Exploration

7.1.1. Black-boxing Verilog HDL Modules

7.1.2. Black-boxing VHDL Modules

7.1.3. Clock Management

7.1.4. Pin Assignments

7.1.5. Transceiver Control Logic

7.1.6. Upgrade Outdated IP Cores

7.2. Top-Level Design Considerations

8. Appendices

8.1. Appendix A: Parameterizable Pipeline Modules

8.2. Appendix B: Clock Enables and Resets

8.2.1. Synchronous Resets and Limitations

8.2.1.1. Synchronous Resets Summary

8.2.2. Retiming with Clock Enables

8.2.2.1. Example for Broadcast Control Signals

8.2.3. Resolving Short Paths

9. Hyperflex® Architecture High-Performance Design Handbook Archive

10. Hyperflex® Architecture High-Performance Design Handbook Revision History

2.3.1. Conventional Versus Hyper-Pipelining

Hyper-Pipelining simplifies this process of conventional pipelining. Conventional pipelining includes the following design modifications:

Add two registers between logic clouds.
Modify HDL to insert a third register (or pipeline stage) into the design’s logic cloud, which is Logic Cloud 2. This register insertion effectively creates Logic Cloud 2a and Logic Cloud 2b in the HDL

Figure 32. Conventional Pipelining User Modifications

Figure 33. Hyper-Pipelining User ModificationsHyper-Pipelining simplifies the process of adding registers. Add the registers—Pipe 1, Pipe 2, and Pipe 3—in aggregate at one location in the design RTL. The Compiler retimes the registers throughout the circuit to find the optimal placement along the path. This optimization reduces path delay and maximizes the design's operating frequency.

Figure 34. Hyper-Pipelining and Hyper-Retiming ImplementationThe following figure shows implementation of additional registers after the retiming stage completes optimization.

The resulting implementation in the Hyper-Pipelining flow differs from the conventional pipelining flow by the location of the Pipe 3 register. Because the Compiler is aware of the current circuit implementation, including routing, the Compiler can more effectively locate the aggregate registers to meet the design’s maximum operating frequency. Hyper-Pipelining requires significantly less effort than conventional pipelining techniques because you can place registers at a convenient location in a data path. The Compiler optimizes the register placements automatically.

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Hyperflex® Architecture High-Performance Design Handbook

2.3.1. Conventional Versus Hyper-Pipelining