2.4.1.3. FPGA AI Suite IP Datapath Component Organization

FPGA AI Suite Handbook

Download PDF

ID 863373

Date 11/21/2025

Version 2025.3

Public

2.4.1.3. FPGA AI Suite IP Datapath Component Organization

The FPGA AI Suite IP provides components in the datapath that can be customized via an architecture (.arch) file.

Figure 7. FPGA AI Suite IP Block Diagram

The following diagram shows the connections and I/O of these components:

Processing element array
Many of the concepts described below are related to the Processing Element. Refer to the detailed discussion in the Parallelism in the FPGA AI Suite IP before continuing.
Input feeder
For details, refer to Input Feeder.
PE array filter Interleave
For details, refer to PE Array Filter Interleave
Scratchpad sizing
Scratchpad memory holds filters and supplies them to the PE array in parallel.
Crossbar
Crossbar acts as a cache for the output feature exits from the PE array. The size of the buffer array in the crossbar is proportional to the output size of the PE array.
Architecture Precision
For details, refer to Architecture Precision

To learn more about Architecture File parameters in an example architecture file,refer to Creating an Architecture File for the FPGA AI Suite IP. To explore advanced optimization techniques provided by the Architecture File, refer to Optimizing Your FPGA AI Suite IP.

Input Feeder

Input features that come from DMA are stored temporarily in a static cache, which is the stream buffer. Stream buffer depth controls the input feature size that can be stored on-chip. Stream buffer width is proportional to the PE array input size, c_vector and k_vector, as it needs to sufficiently supply enough input features to the PEs.

The stream buffer size can be changed by the architecture optimizer. Changing the size of the stream affects the performance and area: M20K is reversely proportional to the buffer size since they are used for the buffer; FPS lowers as the buffer size, but smaller graph is less impacted; DDR bandwidth: reduce the on-chip buffering meaning spilling to DDR more.

PE Array Filter Interleave

DSPs inside the PE array have a fixed latency, L, in computing multiply-and-add. This latency would lower the occupancy due to loop-carry dependency of PE if the PE had naively computed dot-product of consecutive features and filters.

In the naïve case illustrated above, without any interleaving, consecutive features and filters are multiplied and accumulated together. In this case, to compute the next partial sum, $x_{t} \cdot k_{t}$ , the previous partial sum, $x_{t - 1} \cdot k_{t - 1}$ , needs to be available immediately at the FP32 add block. This is impossible due to the latency of the hardware. Consequently, this occurs a loop-carry dependency show in the red feedback path of 4 clock cycles.

By interleaving filters (notice that different filters are multiplied with the same feature) and inserting pipeline registers of depth equal to the latency and interleaving filters, the PE computes partial sums for L different output features. In this way, the next partial sum does not depend on the result of the previous. Therefore, the penalty on the feedback path (shown in green) is eased. Because the PE is fully pipelined in this way, the occupancy of the PE is optimized, which increases throughout.

Inserting bias into the dot-product computation is enabled by this architecture of the PE. By multiplexing the input to the pipeline registers, the addition can choose between biases or the partial sums.

For more information on setting the interleaving parameters, refer to PE Array Parameters: num_interleaved_features, num_interleaved_filters.

Architecture Precision

The precision for filter weights. This parameter affects DDR traffic and on-chip buffer area usage, and the size of IP.

The dla_compiler tool maps each layer of your graph to the IP block that performs the computation for a layer during inference. The precision at which each block operates is as follows:

The accumulator operates at fp32 precision.

The accumulator bias value (if present) is fp16 precision.

The auxiliary blocks (activations, pooling, and depthwise convolution) operate at fp16 precision.

The feature inputs to the FPGA AI Suite IP must be provided at fp16 precision:
- External AXI Bus Interface Parameters
  Defines the AXI bus width. Changing this parameter affects the DDR bandwidth and fine-tuning this parameter is the responsibility of the FPGA team.
- Types/vectorization of auxiliary layer blocks
  Enables an auxiliary module, typically, an activation module to support execution of that activation function on FPGA. If an activation module is turned off, host CPU needs to execute that activation layer, descendant or parent layers may be assigned to CPU.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA AI Suite Handbook

2.4.1.3. FPGA AI Suite IP Datapath Component Organization

Input Feeder

PE Array Filter Interleave

Architecture Precision