Intel® FPGA AI Suite: PCIe-based Design Example User Guide

ID 768977
Date 12/01/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

7.2. Hardware

This section describes the Example Design ( Intel® Arria® 10) in detail. However, many of the components close to the IP are shared in common with the Example Design ( Intel Agilex® 7).

A top-level view of the design example is shown in Intel FPGA AI Suite Example Design Top Level.

There are two instances of Intel FPGA AI Suite, shown on the right (dla_top.sv). All communication between the Intel FPGA AI Suite IP systems and the outside occurs via the Intel FPGA AI Suite DMA. The Intel FPGA AI Suite DMA provides a CSR (which also has interrupt functionality) and reader/writer modules which read/write from DDR.

The host communicates with the board through PCIe* using the CCI-P protocol. The host can do the following things:

  1. Read and write the on-board DDR memory (these reads/writes do not go through Intel FPGA AI Suite).
  2. Read/write to the Intel FPGA AI Suite DMA CSR of both instances.
  3. Receive interrupt signals from the Intel FPGA AI Suite DMA CSR of both instances.

Each Intel FPGA AI Suite IP instance can do the following things:

  1. Read/write to its DDR bank.
  2. Send interrupts to the host through the interrupt interface.
  3. Receive reads/writes to its DMA CSR.

From the perspective of the Intel FPGA AI Suite accelerator function (AF), external connections are to the CCI-P interface running over PCIe* , and to the on-board DDR4 memory. The DDR memory is connected directly to board.qsys block, while the CCI-P interface is converted into Avalon memory mapped (MM) interfaces in bsp_logic.sv block for communication with the board.qsys block.

The board.qsys block arbitrates the connections to DDR memory between the reader/writer modules in Intel FPGA AI Suite IP and reads/writes from the host. Each Intel FPGA AI Suite IP instance in this design has access to only one of the two DDR banks. This design decision implies that no more than two simultaneous Intel FPGA AI Suite IP instances can exist in the design. Adding an additional arbiter would relax this restriction and allow additional Intel FPGA AI Suite IP instances.

Much of board.qsys operates using the Avalon® Memory-mapped (MM) interface protocol. The Intel FPGA AI Suite DMA uses AXI protocol, and board.qsys has Avalon® MM interface to AXI adapters just before each interface is exported from the Intel FPGA AI Suite IP (so that outside of the Platform Designer system it can be connected to Intel FPGA AI Suite IP). Clock crossing are also handled inside of board.qsys. For example, the host interface must be brought to the DDR clock to talk with the Intel FPGA AI Suite IP CSR.

There are three clock domains: host clock, DDR clock, and the Intel FPGA AI Suite IP clock. The PCIe* logic runs on the host clock at 200Mhz. Intel FPGA AI Suite DMA and the platform adapters run on the DDR clock. The rest of Intel FPGA AI Suite IP runs on the Intel FPGA AI Suite IP clock.

Intel FPGA AI Suite IP protocols:

  • Readers and Writers: 512-bit data (width configurable), 32-bit address AXI4 interface, 16-word max burst (width fixed).
  • CSR: 32-bit data, 11-bit address
Figure 3.  Intel FPGA AI Suite Example Design Top Level
Note: Arrows show host/agent relationships. Clock domains indicated with dashed lines.

The board.qsys interfaces between DDR memory, the readers/writers, and the host read/write channels. The internals of the board.qsys block are shown in Figure 4. This figure shows three Avalon MM interfaces on the left and bottom: MMIO, host read, and host write.

  • Host read is used to read data from DDR memory and send it to the host.
  • Host write is used to read data from the host into DDR memory.
  • The MMIO interface performs several functions:
    • DDR read and write transactions are initiated by the host via the MMIO interface
    • Reading from the AFU ID block. The AFU ID block identifies the AFU with a unique identifier and is required for the OPAE driver.
    • Reading/writing to the DLA DMA CSRs where each instance has its own CSR base address.
Figure 4. The board.qsys Block, Showing Two DDR Connections and Two IP Instances
Note: Arrows indicate host/agent relationships (from host to agent).

The above figure also shows the ddr_board.qsys block. The three central blocks (address expander, msgdma_bbb.qsys (scatter-gather DMA), and msgdma_bbb.qsys) allow host direct memory access (DMA) to DDR. This DMA is distinct from the DMA module inside of the Intel FPGA AI Suite IP, shown in Figure 3. Host reads and writes begin with the host sending a request via the MMIO interface to initiate a read or write. When requesting a read, the DMA gathers the data from DDR and sends it to the host via the host-read interface. When requesting a write, the DMA reads the data over the host-write interface and subsequently writes it to DDR.

Note that in board.qsys, a block for the Avalon MM to AXI4 conversion is not explicitly instantiated. Instead, an Avalon MM pipeline bridge connects to an AXI4 bridge. Platform Designer implicitly infers a protocol adapter between these two bridges.

Note: Avalon MM/AXI4 adapters in Platform Designer might not close timing.

Platform Designer optimizes for area instead of fMAX by default, so you might need to change the interconnect settings for the inferred Avalon MM/AXI4 adapter. For example, we made some changes as shown in the following figure.

Figure 5. Adjusting the Interconnect Settings for the Inferred Avalon MM/AXI4 Adapter to Optimize for fMAX Instead of Area.
Note: This enables timing closure on the DDR clock.

To access the view in the above figure:

  • Within the Platform Designer GUI choose View -> Domains. This brings up the Domains tab in the top-right window.
  • From there, choose an interface (for example, ddr_0_axi).
  • For the selected interface, you can adjust the interconnect parameters, as shown on the bottom-right pane.
  • In particular, we needed to change Burst adapter implementation from Generic converter (slower, lower area) to Per-burst-type converter (faster, higher area) to close timing on the DDR clock.

This was the only change needed to close timing, however it took several rounds of experimentation to determine this was the setting of importance. Depending on your system, other settings might need to be tweaked.