Streaming DMA Accelerator Functional Unit (AFU) User Guide

Updated for Intel® Acceleration Stack for Intel® Xeon® CPU with FPGAs: 1.1 Production
1. About this Document...................................................................................................... 3
  1.1. Intended Audience.................................................................................................3
  1.2. Conventions..........................................................................................................3
  1.3. Acronyms............................................................................................................. 3
  1.4. Acceleration Glossary.............................................................................................4

2. Streaming DMA AFU Description..................................................................................... 5
  2.1. Hardware Subsystems............................................................................................5
  2.2. Streaming DMA Test System................................................................................... 7
  2.3. Memory-to-Stream DMA BBB.................................................................................. 8
  2.4. Stream-to-Memory DMA BBB.................................................................................10

3. Memory Map and Address Spaces................................................................................. 12
  3.1. Streaming DMA AFU Memory Map.......................................................................... 12
  3.2. Memory-to-Stream DMA BBB Memory Map..............................................................13
  3.3. Stream-to-Memory DMA BBB Memory Map..............................................................14
  3.4. Device Feature Header Linked-list.......................................................................... 14

4. Software Programming Model.......................................................................................16

5. Running the AFU Example.............................................................................................17

6. Generating the Accelerator Function (AF).....................................................................19

7. Simulating the AFU Example......................................................................................... 20

8. Document Revision History for Streaming DMA Accelerator Functional Unit (AFU)
   User Guide...............................................................................................................22
1. About this Document

This document describes the streaming direct memory access (DMA) Accelerator Functional Unit (AFU) implementation using the Platform Designer.

1.1. Intended Audience

This document is intended for hardware or software developer who requires an Accelerated Function (AF) that accesses the data buffered in memory and provides it to an accelerator as a serial stream of data. Intel recommends you gain familiarity with Platform Designer before using this design example.

Related Information
Platform Designer User Guide

1.2. Conventions

Table 1. Document Conventions

<table>
<thead>
<tr>
<th>Convention</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>#</td>
<td>If this symbol precedes a command, enter the command as a root.</td>
</tr>
<tr>
<td>$</td>
<td>If this symbol precedes a command, enter the command as a user.</td>
</tr>
<tr>
<td>This font</td>
<td>Indicates file names, commands, and keywords. The font also indicates long command lines. For long command lines, press Enter only if the next line starts a new command, where the # or $ character denotes the start of the next command.</td>
</tr>
<tr>
<td>&lt;variable_name&gt;</td>
<td>Indicates placeholder text that you must replace with appropriate values. Do not include the angle brackets.</td>
</tr>
</tbody>
</table>

1.3. Acronyms

Table 2. Acronyms

<table>
<thead>
<tr>
<th>Acronyms</th>
<th>Expansion</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>AF</td>
<td>Accelerator Function</td>
<td>Compiled Hardware Accelerator image implemented in FPGA logic that accelerates an application.</td>
</tr>
<tr>
<td>AFU</td>
<td>Accelerator Functional Unit</td>
<td>Hardware Accelerator implemented in FPGA logic which offloads a computational operation for an application from the CPU to improve performance.</td>
</tr>
<tr>
<td>API</td>
<td>Application Programming Interface</td>
<td>A set of subroutine definitions, protocols, and tools for building software applications.</td>
</tr>
<tr>
<td>ASE</td>
<td>AFU Simulation Environment</td>
<td>Co-simulation environment that allows you to use the same host application and AF in a simulation environment. ASE is part of the Intel Acceleration Stack for FPGAs.</td>
</tr>
</tbody>
</table>

*Other names and brands may be claimed as the property of others.
<table>
<thead>
<tr>
<th>Acronyms</th>
<th>Expansion</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>CCI-P</td>
<td>Core Cache Interface</td>
<td>CCI-P is the standard interface AFUs use to communicate with the host.</td>
</tr>
<tr>
<td>CL</td>
<td>Cache Line</td>
<td>64-byte cache line</td>
</tr>
<tr>
<td>DFH</td>
<td>Device Feature Header</td>
<td>Creates a linked list of feature headers to provide an extensible way of adding features.</td>
</tr>
<tr>
<td>FIM</td>
<td>FPGA Interface Manager</td>
<td>The FPGA hardware containing the FPGA Interface Unit (FIU) and external interfaces for memory, networking, etc. The Accelerator Function (AF) interfaces with the FIM at run time.</td>
</tr>
<tr>
<td>FIU</td>
<td>FPGA Interface Unit</td>
<td>FIU is a platform interface layer that acts as a bridge between platform interfaces like PCIe*, UPI and AFU-side interfaces such as CCI-P.</td>
</tr>
<tr>
<td>MPF</td>
<td>Memory Properties Factory</td>
<td>The MPF is a Basic Building Block (BBB) that AFUs can use to provide CCI-P traffic shaping operations for transactions with the FIU.</td>
</tr>
</tbody>
</table>

### 1.4. Acceleration Glossary

#### Table 3. Acceleration Stack for Intel® Xeon® CPU with FPGAs Glossary

<table>
<thead>
<tr>
<th>Term</th>
<th>Abbreviation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Intel® Acceleration Stack for Intel Xeon® CPU with FPGAs</td>
<td>Acceleration Stack</td>
<td>A collection of software, firmware and tools that provides performance-optimized connectivity between an Intel FPGA and an Intel Xeon processor.</td>
</tr>
<tr>
<td>Intel Programmable Acceleration Card with Intel Arria® 10 GX FPGA</td>
<td>Intel PAC with Intel Arria 10 GX FPGA</td>
<td>PCIe accelerator card with an Intel Arria 10 FPGA. Programmable Acceleration Card is abbreviated PAC. Contains an FPGA Interface Manager (FIM) that pairs with an Intel Xeon processor over PCIe bus.</td>
</tr>
<tr>
<td>Intel Xeon Scalable Platform with Integrated FPGA</td>
<td>Integrated FPGA Platform</td>
<td>Intel Xeon plus FPGA platform with the Intel Xeon and an FPGA in a single package and sharing a coherent view of memory via the Ultra Path Interconnect (UPI).</td>
</tr>
<tr>
<td>OPAE_PLATFORM_ROOT</td>
<td></td>
<td>A Linux shell environment variable set up during the process of installing the OPAE SDK delivered with the Acceleration Stack.</td>
</tr>
</tbody>
</table>
2. Streaming DMA AFU Description

The streaming DMA AFU design example shows how to transfer data between the memory and Avalon®-ST sources and sinks. Most commonly, a streaming DMA is utilized to transfer data from host memory into a hardware accelerator and stream the results back to host memory without using the local FPGA memory as a temporary buffer. These streams typically operate in parallel mode and reduce the latency of a hardware accelerator by removing the additional memory copy operations.

The streaming DMA AFU comprises of the following sub-modules:

- Memory Properties Factory (MPF) Basic Building Block (BBB)
- Core Cache Interface (CCI-P) to Avalon-MM Adapter
- Streaming DMA Test System, which includes:
  - Memory-to-Stream (M2S) DMA BBB
  - Stream-to-Memory (S2M) DMA BBB
  - Streaming Pattern Checker and Generator

The streaming DMA AFU design example includes a user space driver as well as a host application that performs data transfer between host memory and the FPGA pattern checker and generator. You can use this design example as a starting point to implement streaming data transfers in your own AFU design by replacing the pattern checker and generator with your hardware accelerator and modifying the host application accordingly.

Both M2S and S2M DMA BBBs support packetized data, therefore the streaming data includes the start-of-packet (SOP), end-of-packet (EOP), and empty signals. You can use this packet support to transfer a hardware driven payload size. For example, a compression accelerator typically receives a known payload size; and the compression results have an unknown length until the accelerator completes this task. The compression accelerator simply issues a packet to the S2M DMA BBB and the driver provides the host application a continuous buffer that contains the compressed results and buffer length.

Related Information
Avalon Interface Specifications

2.1. Hardware Subsystems

The streaming DMA AFU interfaces with the FPGA Interface Unit (FIU) and two banks of local external SDRAM. The streaming DMA BBBs can address up to 256TB of memory connected to the FPGA. The streaming DMA design example reduces this memory span down to 8GB, which is split into two memory banks. If you are using the streaming DMA BBBs in a design that targets a platform with a different local memory hierarchy or density, then you can adjust the local memory pipeline bridges in the streaming DMA test system using Platform Designer.
You can use the streaming DMA AFU to perform the following data transfer:

- Host memory to FPGA stream
- FPGA stream to host memory
- Local FPGA memory to FPGA stream\(^{(1)}\)
- FPGA stream to local FPGA memory\(^{(1)}\)

The streaming DMA AFU, M2S and S2M DMA BBBs are implemented as Platform Designer systems. Each of these systems can be found in the following location:

$OPAE\_PLATFORM\_ROOT/hw/samples/streaming_dma_afu/hw/rtl/qsys

---

**Figure 1. High Level System Diagram**

The streaming DMA AFU includes the following modules that connect to the FIU:

\(^{(1)}\) Currently, the driver does not support this feature.
• MMIO Decode Logic—detects MMIO read and write transactions and separates them from the CCI-P RX channel 0 that they arrive from. This ensures that MMIO traffic never reaches the MPF BBB and is serviced by an independent MMIO command channel.

• MPF—ensures that read issued by the M2S DMA BBB are returned in the order that they were issued. The streaming DMA BBBs use the Avalon-MM protocol which requires the read data to return in-order.

• CCI-P to Avalon-MM Adapter—translates MMIO accesses to Avalon-MM read and write transactions. This module also receives Avalon-MM read and write transactions from the streaming DMA BBBs and converts them to CCI-P transactions that are issued to the host.

• Streaming DMA Test System—a wrapper around the two streaming DMA BBBs and includes pattern checker and generator components. This module exposes Avalon-MM master and slave interfaces that connect to the CCI-P to Avalon-MM adapter.

### 2.2. Streaming DMA Test System

The streaming DMA test system is a Platform Designer system that connects the streaming DMA BBBs to other IP in the system.

**Figure 2. Streaming DMA Test System Block Diagram**

The streaming DMA test system includes the following modules:
• AFU DFH—stores the 64-bit device feature header (DFH) for the streaming DMA AFU. The host driver scans the hardware that is searching for the AFU DFH and various BBBs used to identify the hardware. The AFU DFH is setup to point to the next DFH at offset 0x100.

• M2S DMA BBB—reads buffers from memory and provides the data as a serial stream to the Avalon-ST source port. In this design example, the streaming data is sent to the pattern checker.

• S2M DMA BBB—accepts a serial stream of data from its Avalon-ST port and writes the data to buffers in memory. In this design example, the streaming data is sent from the pattern generator.

• Pattern Checker and Generator—this module is programmed by the host with a pattern. The supplied host software configures each component with a pattern that increments by one for every increasing byte.

• Clock Crossing Bridge—this module has been added between the streaming DMAs and the local FPGA external memory to operate the streaming DMA AFU in the pClk clock domain.

• Pipeline Bridge—this module has been added between the M2S DMA BBB and host read interface of the CCI-P to Avalon-MM adapter to improve the maximum operating frequency (Fmax) of the streaming DMA AFU.

• Write Response Bridge—this module has been added between the S2M DMA BBB and host write interface of the CCI-P to Avalon-MM adapter to improve the maximum operating frequency (Fmax). It also sends write responses from the CCI-P interface to the S2M DMA.

2.3. Memory-to-Stream DMA BBB

The Memory-to-Stream (M2S) DMA BBB reads data from a buffer stored in memory and converts it into an Avalon-ST source stream. The buffer must be aligned to 64-bytes which is guaranteed by the driver for locations in host memory. The M2S DMA BBB is configured to handle up to a 1GB transfer size, but the driver divides the large transfers into smaller ones with a maximum size of 2MB.

The M2S DMA BBB streaming interface supports packet generation by exposing the start-of-packet (SOP), end-of-packet (EOP), and empty signals. Your host application can optionally instruct the streaming DMA driver to generate packetized data. If you enable the packetized data, then the empty signal conveys the number of bytes at the end of a transfer that are valid when the EOP signal is asserted. For example, a DMA transfer of 4100 bytes (with packet support) contains 64 full beats (each beat is 64 bytes) of streaming data with SOP asserted during the first beat. The empty signal is set to 60 during the last beat of data with EOP asserted, because only four bytes out of the 64 are valid.
The components in the M2S DMA BBB Platform Designer system implement the following functions:

- **M2S DMA BBB DFH**—stores the 64-bit device feature header (DFH) for the M2S DMA BBB. The host driver scans the hardware that is searching for the AFU DFH and various BBBs used to identify the hardware. The M2S DMA BBB DFH is setup to point to the next DFH at offset 0x100.

- **mSGDMA Dispatcher**—buffers descriptors sent from the host to the BBB

- **mSGDMA Read Master**—accepts commands from the dispatcher and reads from memory and converts the data to an Avalon-ST stream. The data leaving the streaming port can be accompanied by streaming sideband signaling for SOP, EOP, and empty signals. If you require the stream to support non-multiples of 64 bytes, then you must request the driver to send packetized data. Therefore, if the last beat is not 64 bytes in size, then the empty signal informs your downstream hardware about the invalid bytes. Only the last beat can contain invalid bytes, all other beats must be 64 bytes in size which is defined by the Avalon-ST specification.

- **Pipeline Bridge**—this component has been added between the mSGDMA read master and host/local FPGA memory to improve the maximum operating frequency (Fmax) of the M2S DMA BBB. If your design does not require the M2S DMA BBB to connect to local FPGA memory, then export that interface and ground all its master inputs. All the mSGDMA dispatcher slave interfaces connect to a pipeline bridge which spans an address range of 0x100.
2.4. Stream-to-Memory DMA BBB

The Steam-to-Memory (S2M) DMA BBB accepts Avalon-ST data and transfers it to a buffer in memory. The buffer must be aligned to 64-bytes which is guaranteed by the driver for locations in host memory. The S2M DMA BBB is configured to handle up to a 1GB transfer size, but the driver divides the large transfers into smaller ones with a maximum size of 2MB.

The S2M DMA BBB streaming interface supports receiving packetized data by exposing the SOP, EOP, and empty signals. Your host application instructs the streaming DMA driver to use the packet signaling when it requests a streaming transfer. By using the packetized data, the hardware accelerator that provides the data can determine when transfer will complete. For example, if a data compression engine is connected to the S2M DMA BBB, the host application does not know how much data might stream until the compression operation is complete. Instead of dividing this data into frames, your hardware accelerator simply notifies the start and end of the payload via asserting SOP and EOP respectively. The DMA transfers the entire payload to memory and DMA driver instructs the host application of the payload length when it is complete.

Figure 4. S2M DMA BBB Platform Designer System

The components in the S2M DMA BBB Platform Designer system implement the following functions:
• S2M DMA BBB DFH—stores the 64-bit device feature header (DFH) for the S2M DMA BBB. The host driver scans the hardware that is searching for the AFU DFH and various BBBs used to identify the hardware. The S2M DMA DMA BBB DFH points to the next DFH at offset 0x100.

• mSGDMA Dispatcher—buffers descriptors sent from the host to the BBB. The dispatcher includes a response interface that the host driver reads to determine how much data was transferred when the data is packetized (non-deterministic payload length). This component is included with the design example because it is a slightly modified version of the component that is available in Intel Quartus® Prime Pro Edition.

• mSGDMA Write Master—accepts commands from the dispatcher and writes the data accepted by the Avalon-ST sink interface to memory. The data arriving at the streaming port can be accompanied by streaming sideband signaling for SOP, EOP, and empty signals. This component is included with the design example because it is a slightly modified version of the component that is available in Intel Quartus Prime Pro Edition.

• Pipeline Bridge—this component has been added between the mSGDMA write master and local FPGA memory to improve the maximum operating frequency (Fmax) of the S2M DMA BBB. If your design does not require the S2M DMA BBB to connect to local FPGA memory, then export that interface and ground all its master inputs. All the mSGDMA dispatcher slave interfaces connect to a pipeline bridge which spans an address range of 0x100.

• Write Response Bridge—this component has been added between the mSGDMA write master and host write interface of the CCI-P to Avalon-MM adapter to improve the maximum operating frequency (Fmax) of the S2M DMA BBB. It also forwards write responses to the write master. The streaming DMA driver instructs S2M DMA BBB to wait for all write responses to return before it sends an interrupt to the host ensuring that there are no write synchronization issues.
3. Memory Map and Address Spaces

The streaming DMA AFU has two memory views:

- DMA view
- Host view

The DMA view supports a 49-bit address space. The lower half of the DMA view maps to the local FPGA memory. Only the streaming DMA BBBs have connectivity to the local FPGA memory, the host cannot access the local FPGA memory. The upper half of the DMA view maps to host memory.

The host view includes all the registers accessible through MMIO accesses such as DFH tables, and control/status registers of the various components that are used inside the streaming DMA AFU.

The MMIO registers in both streaming DMA BBBs and the streaming DMA AFU support 32- and 64-bit access. The streaming DMA AFU does not support 512-bit MMIO accesses. The mSGDMA registers inside each streaming DMA BBB must be accessed using 32-bit accesses except for descriptor and response accesses.

3.1. Streaming DMA AFU Memory Map

The streaming DMA register map provides the absolute addresses of all the locations within the unit. These registers are in the host view because only the host can access them.

<table>
<thead>
<tr>
<th>Byte Address</th>
<th>Register Name</th>
<th>Span in Bytes</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x0000</td>
<td>Streaming DMA AFU DFH</td>
<td>0x40</td>
<td>Device feature header for the streaming DMA AFU. This DFH points to 0x100 as the next DFH offset.</td>
</tr>
<tr>
<td>0x0100</td>
<td>M2S DMA BBB</td>
<td>0x100</td>
<td>Memory-to-stream DMA BBB.</td>
</tr>
<tr>
<td>0x0200</td>
<td>S2M DMA BBB</td>
<td>0x100</td>
<td>Stream-to-memory DMA BBB.</td>
</tr>
<tr>
<td>0x0300</td>
<td>NULL DFH</td>
<td>0x40</td>
<td>Null device feature header terminating the DFH linked list.</td>
</tr>
<tr>
<td>0x1000</td>
<td>Pattern Checker Memory Slave</td>
<td>0x1000</td>
<td>Pattern checker memory populated by the host application.</td>
</tr>
<tr>
<td>0x2000</td>
<td>Pattern Generator Memory Slave</td>
<td>0x1000</td>
<td>Pattern generator memory populated by the host application</td>
</tr>
<tr>
<td>0x3000</td>
<td>Pattern Checker CSR Slave</td>
<td>0x10</td>
<td>Pattern checker control and status registers</td>
</tr>
<tr>
<td>0x3010</td>
<td>Pattern Generator CSR Slave</td>
<td>0x10</td>
<td>Pattern generator control and status registers.</td>
</tr>
</tbody>
</table>
3.2. Memory-to-Stream DMA BBB Memory Map

The M2S DMA BBB memory map provides the address offsets of all the locations within the BBB. The following streaming DMA AFU registers reside at offset 0x100 in the MMIO address space.

* You can adjust the local FPGA memory addressable space in the DMA AFU platform designer system. The S2M and M2S DMAs are designed to address up to 256 TB of FPGA memory.
### Table 5. Memory-to-Stream DMA BBB Memory Map

<table>
<thead>
<tr>
<th>Byte Address</th>
<th>Register Name</th>
<th>Span in Bytes</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x00</td>
<td>M2S DMA BBB DFH</td>
<td>0x40</td>
<td>Device feature header for the M2S DMA BBB. This DFH points to 0x100 as the</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>next DFH offset.</td>
</tr>
<tr>
<td>0x40</td>
<td>M2S DMA Dispatcher CSR</td>
<td>0x20</td>
<td>Control port for the mSGDMA within the memory-to-stream DMA BBB. The driver</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>accesses this location to control the DMA or query its status.</td>
</tr>
<tr>
<td>0x60</td>
<td>M2S DMA Descriptor</td>
<td>0x20</td>
<td>Descriptor port for the mSGDMA within the memory-to-stream DMA BBB. The</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>driver writes descriptors to this location.</td>
</tr>
</tbody>
</table>

### 3.3. Stream-to-Memory DMA BBB Memory Map

The S2M DMA BBB memory map provides the address offsets of all the locations within the BBB. The following streaming DMA AFU registers reside at offset 0x200 in the MMIO address space.

### Table 6. Stream-to-Memory DMA BBB Memory Map

<table>
<thead>
<tr>
<th>Byte Address</th>
<th>Register Name</th>
<th>Span in Bytes</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x00</td>
<td>S2M DMA BBB DFH</td>
<td>0x40</td>
<td>Device feature header for the S2M DMA BBB. This DFH points to 0x100 as the</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>next DFH offset.</td>
</tr>
<tr>
<td>0x40</td>
<td>S2M DMA Dispatcher CSR</td>
<td>0x20</td>
<td>Control port for the mSGDMA within the stream-to-memory DMA BBB. The driver</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>accesses this location to control the DMA or query its status.</td>
</tr>
<tr>
<td>0x60</td>
<td>S2M DMA Descriptor</td>
<td>0x20</td>
<td>Descriptor port for the mSGDMA within the stream-to-memory DMA BBB. The</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>driver writes descriptors to this location.</td>
</tr>
<tr>
<td>0x80</td>
<td>S2M DMA Response</td>
<td>0x8</td>
<td>Response port for the mSGDMA within the stream-to-memory DMA BBB. The</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>driver reads this port to determine how much data was streamed to the memory.</td>
</tr>
</tbody>
</table>

### 3.4. Device Feature Header Linked-list

The streaming DMA AFU design example contains four device feature headers (DFH) that form a linked list. This linked list allows the sample application to identify the streaming DMA AFU as well as the driver to identify each of the streaming DMA BBBS. A NULL DFH included at the end of the list. The inclusion of the null DFH at the end of the linked list allows you to add more streaming DMA BBBS to your design. You simply need to move the NULL DFH to an address after the other BBBS. Each streaming DMA BBB expects the next DFH to be located 0x100 bytes from the base address of the BBB. The following figure depicts the linked-list for the streaming DMA AFU design example.
If you want two M2S and two S2M DMA BBBS in your design, then you can use the following address map to implement four streaming channels. The four streaming DMA BBBS can reside anywhere in the address map if they are packed together in the MMIO address space every 0x100 bytes. The DFH that follows the streaming DMA BBB must be located at offset 0x100 from the previous streaming DMA BBB channel and it can be the NULL DFH or other DFHs.

<table>
<thead>
<tr>
<th>Byte Address</th>
<th>Register Name</th>
<th>Span in Bytes</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x000</td>
<td>Streaming DMA AFU DFH</td>
<td>0x40</td>
<td>Your AFU DFU. This DFH points to 0x100 as the next DFH offset.</td>
</tr>
<tr>
<td>0x100</td>
<td>M2S DMA BBB #1</td>
<td>0x100</td>
<td>First memory-to-stream DMA BBB. Next DFH set to 0x100.</td>
</tr>
<tr>
<td>0x200</td>
<td>M2S DMA BBB #2</td>
<td>0x100</td>
<td>Second memory-to-stream DMA BBB. Next DFH set to 0x100.</td>
</tr>
<tr>
<td>0x300</td>
<td>S2M DMA BBB #1</td>
<td>0x100</td>
<td>First stream-to-memory DMA BBB. Next DFH set to 0x100.</td>
</tr>
<tr>
<td>0x400</td>
<td>S2M DMA BBB #2</td>
<td>0x100</td>
<td>Second stream-to-memory DMA BBB. Next DFH set to 0x100.</td>
</tr>
<tr>
<td>0x500</td>
<td>NULL DFH</td>
<td>0x40</td>
<td>Null DFH at the end of the linked list.</td>
</tr>
</tbody>
</table>
4. Software Programming Model

The streaming DMA AFU includes a user space driver that you can use in your own host application. The streaming DMA AFU host application, including the user space driver are located at the following location:

$$\texttt{OPAE\_PLATFORM\_ROOT/hw/samples/streaming\_dma\_afu/sw}$$

All the driver APIs are documented in the \texttt{fpga\_dma.h} header file. The user space driver supports both blocking and non-blocking DMA transfers. While using both the streaming DMA BBBs to stream data to and from your accelerator, Intel recommends that you use non-blocking (asynchronous) transfers so that both DMAs can transfer data simultaneously. Using the blocking (synchronous) transfer API to transfer data to and from your accelerator concurrently may lead to deadlock, since each streaming DMA can only buffer approximately 8KB of data before back-pressuring.
5. Running the AFU Example

Intel recommends you refer to the Quick Start Guide for your Intel PAC to be familiar with running similar examples. You must set the `OPAE_PLATFORM_ROOT` environment variable before proceeding any further.

**Note:** Intel recommends you use the GCC (C Compiler) to compile the design example. If you compile the DMA sample application and user space driver with g++ (C++ compiler), it may result in compilation errors.

Follow these steps to download the DMA Accelerator Function (AF) bitstream, build, and run the design example:

1. `sudo sh -c "echo 20 > /sys/kernel/mm/hugepages/hugepages\-2048kB/nr_hugepages"`
   If you have not already done so, use the above command to configure the system and allocate 20 count of 2MB hugepages for the DMA user space driver(2). This command requires root privileges.
2. `cd $OPAE_PLATFORM_ROOT/hw/samples/streaming_dma_afu/sw`
3. `make`
4. `sudo fpgaconf ../bin/streaming_dma_afu.gbs`
5. `sudo LD_LIBRARY_PATH=`pwd`:LD_LIBRARY_PATH \\ ./fpga_dma_st_test 0`

Sample output from running the DMA test software:

```
Running test in HW mode
No of DMA channels = 00000002
DMA Base Addr = 00000100
DMA Base Addr = 00000200
M2S Checker:Data Verification Success!
M2S Checker:Data Verification Success!
S2M: Data Verification Success!
S2M: Data Verification Success!
Running Bandwidth Tests.. Streaming from host memory to FPGA..
M2S Checker:Data Verification Success!
Measured bandwidth = 6732.154665 Megabytes/sec
Streaming from FPGA to host memory..
Verifying buffer..
S2M: Data Verification Success!
Measured bandwidth = 5434.340969 Megabytes/sec
```

(2) If your host has multiple cards, you need 20 count of 2MB hugepages per card. For an example, a multi-channel system with four cards requires total 80 count of 2MB hugepages.
Related Information

Intel FPGA Acceleration Hub: Knowledge Center

Provides more information about the related resources, collateral, and training.
6. Generating the Accelerator Function (AF)

To generate a synthesis build environment to generate an AF, use the afu_synth_setup command as following:

1. cd $OPAE_PLATFORM_ROOT/hw/samples/streaming_dma_afu
2. afu_synth_setup --source hw/rtl/filelist.txt build_synth

From the synthesis build directory generated by afu_synth_setup, enter the following command from a terminal window to generate an AF for the target hardware platform:

3. cd build_synth
4. $OPAE_PLATFORM_ROOT/bin/run.sh

The run.sh AF generation script creates the AF image with the same base filename as the AFU's platform configuration file with a .gbs suffix at the location:

$OPAE_PLATFORM_ROOT/hw/samples/build_synth/streaming_dma_afu.gbs.
7. Simulating the AFU Example

Intel recommends you refer to the Quick Start Guide for your Intel PAC to be familiar with simulating similar examples and to setup your environment. You must set the OPAE_PLATFORM_ROOT environment variable before proceeding any further.

**Note:** Intel recommends you use the GCC (C Compiler) to compile the design example. If you compile the DMA sample application and user space driver with g++ (C++ compiler), it may result in compilation errors.

Complete the following steps to setup the hardware simulator for the streaming DMA AFU:

1. cd $OPAE_PLATFORM_ROOT/hw/samples/streaming_dma_afu
2. afu_sim_setup --source hw/rtl/filelist.txt build_ase_dir
3. cd build_ase_dir
4. make
5. make sim

**Sample output from the hardware simulator:**

```
[SIM]  ** ATTENTION : BEFORE running the software application **
[SIM]  Set env(ASE_WORKDIR) in terminal where application will run (copy-and-paste) =>
[SIM]  $SHELL | Run:
[SIM]  ---------+---------------------------------------------------
[SIM]  bash/zsh | export ASE_WORKDIR=/mnt/Tools/ias/hw/samples/streaming_dma_afu/build_ase_dir/work
[SIM]  tcsh/csh | setenv ASE_WORKDIR /mnt/Tools/ias/hw/samples/streaming_dma_afu/build_ase_dir/work
[SIM]  For any other $SHELL, consult your Linux administrator
[SIM]  [SIM] Ready for simulation...
[SIM]  Press CTRL-C to close simulator...
```

Complete the following steps to compile and execute the streaming DMA AFU software in the simulation environment:

---

*Other names and brands may be claimed as the property of others.*
7. Simulating the AFU Example

UG-20171 | 2018.08.06

1. Open a new terminal window.
2. cd $OPAE_PLATFORM_ROOT/hw/samples/streaming_dma_afu/sw
3. Copy environment setup string (choose string appropriate for your shell) from the steps above in the hardware simulation to the terminal window. See the following lines in the sample output from the hardware simulator.

   [SIM]  bash/zsh | export ASE_WORKDIR=/mnt/Tools/ias/hw/samples/streaming_dma_afu/build_ase_dir/work
   [SIM]  tcsh/csh | setenv ASE_WORKDIR /mnt/Tools/ias/hw/samples/streaming_dma_afu/build_ase_dir/work

4. make USE_ASE=1
5. LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH ./fpga_dma_st_test 1

Sample output from running software using simulation environment:

   [APP]  Deallocating memory /buf15.894589435998081 ...
   [APP]  SUCCESS
   [APP]  MMIO Write : tid = 0x07f, offset = 0x244, data = 0x00000000
   [APP]  Deinitializing simulation session
   [APP]  Closing Watcher threads
   [APP]  Deallocating UMAS
   [APP]  Deallocating memory /umas.894589435998081 ...
   [APP]  SUCCESS
   [APP]  Deallocating MMIO map
   [APP]  Deallocating memory /mmio.894589435998081 ...
   [APP]  SUCCESS
   [APP]      Took 87,877,947,778 nsec
   [APP]  Session ended

Related Information

Intel FPGA Acceleration Hub: Knowledge Center
Provides more information about the related resources, collateral, and training.

<table>
<thead>
<tr>
<th>Document Version</th>
<th>Intel Acceleration Stack Version</th>
<th>Changes</th>
</tr>
</thead>
<tbody>
<tr>
<td>2018.08.06</td>
<td>1.1 Production</td>
<td>Initial release.</td>
</tr>
<tr>
<td></td>
<td>(supported with Intel Quartus Prime Pro Edition 17.1.1)</td>
<td></td>
</tr>
</tbody>
</table>

Related Information

Intel FPGA Acceleration Hub: Knowledge Center

Provides more information about the related resources, collateral, and training.