AN 829: PCI Express Avalon -MM DMA Reference Design
Avalon® Memory-Mapped (
Avalon®-MM) Direct Memory Access (DMA) Reference
the performance of the
Cyclone® 10 GX, and
Stratix® 10 Hard IP for
interface and an
embedded, high-performance DMA
design includes a Linux software driver to set up the DMA transfers.
The read DMA moves data from the system memory to the on-chip or external memory. The write
DMA moves data from the on-chip or external memory to the system memory.
This reference design allows you to evaluate the performance of the
PCIe® protocol in using the
interface with an
embedded, high-performance DMA.
DMA Reference Design Block DiagramThis bock diagram shows both the on-chip memory and external memory
DMA Reference Design Hardware and Software Requirements
The reference design runs on the following development kits:
Arria® 10 GX FPGA Development Kit
Cyclone® 10 GX FPGA Development
Stratix® 10 FPGA Development Kit
The reference design requires two computers:
A computer with a
running Linux. This computer is computer number 1.
A second computer with the
Quartus® Prime software version 18.0 installed. This computer downloads
the FPGA SRAM Object File (.sof) to the
FPGA on the development kit. This computer is computer number 2.
The reference design software installed on
computer number 1. The
reference designs are available in the Intel FPGA Design
Quartus® Prime Pro Edition Platform Archive
the recommended synthesis, fitter, and timing analysis settings for the
parameters specified in the reference designs.
Quartus® Prime software
installed on computer
number 2. You
can download this software from
Quartus® Prime Pro Edition Software Features/Download web page.
The Linux driver configured specifically for these reference
The Avalon-MM interface with DMA includes the following modules:
Avalon®-MM DMA Reference Design Block DiagramThis bock diagram shows both the on-chip memory and external
Read Data Mover
The Read Data Mover sends memory read
upstream. After the Read Data Mover receives the Completion,
Read Data Mover
the received data to the on-chip or external memory.
Write Data Mover
The Write Data Mover reads data from the on-chip or external memory and
sends the data
upstream using memory write TLPs on the
The Descriptor Controller module manages the DMA read and write
Host software programs internal registers in the Descriptor
Controller with the location and size of the descriptor table residing in the host
system memory through the
Avalon®-MM RX master port. Based on this
information, the Descriptor Controller directs the Read Data Mover to copy the
entire table to local FIFOs for execution. The Descriptor Controller sends
completion status upstream via the
slave (TXS) port.
You can also use your own external descriptor controller to manage
the Read and Write Data Movers. However, you cannot change the interface between
your own external controller and the Read and Write Data Movers embedded in the
The TX Slave module propagates
Avalon®-MM reads and writes upstream.
Avalon®-MM masters, including the
master, can access system memory using the TX Slave. The DMA Controller uses this
path to update the DMA status upstream, using Message Signaled Interrupt (MSI)
(Internal Port for BAR0 Control)
The RX Master module propagates single dword read and write TLPs from
the Root Port to the
Avalon®-MM domain via a 32-bit
Avalon®-MM master port. Software
instructs the RX Master to send control, status, and descriptor information to
Avalon®-MM slaves, including the DMA
The RX Master port is an internal port that not visible in Platform Designer.
Working with the Reference Design
The reference design uses the following directory structure:
top—The top-level module.
top_hw— Platform Designer top-level files. If you modify the design using Platform Designer, you must regenerate the system for the changes to take
Parameter Settings for PCI Express Hard IP Variations
This reference design supports a 256-byte maximum payload size. The following
tables list the values for all the parameters.
Stratix® 10 GX DMA Reference
Design Platform Designer SystemThe
Stratix® 10 design
includes pipeline components and clock-crossing logic that are not present
in the other devices.
Table 10. Platform Designer Port
This is an
Avalon®-MM master port. The
PCIe® host accesses the memory
PCIe® BAR2 for
Arria® 10 and
Cyclone® 10 GX devices. The host
accesses the memory through
PCIe® BAR4 for
Stratix® 10 devices. These BARs connect to both
on-chip and external memory.
In a typical application, system software
controls this BAR to initialize random data in the external
memory. Software also reads the data back to verify correct
This is an
Avalon®-MM slave port. In a typical
master controls this port to send memory reads and writes to
When the DMA completes operation, the
Descriptor Controller uses this port to write DMA status
back to the descriptor table in the
PCIe® domain. The Descriptor
Controller also uses this port to send MSI interrupts
Read Data Mover
This is an
Avalon®-MM master port. The Read Data
Mover uses this
Avalon®-MM master to move data from the
PCIe® domain to either the on-chip or external
memory. The Read Data Mover also uses this port to fetch
descriptors from the
PCIe® domain and write them to the FIFO in the Descriptor
The design includes separate descriptor
tables for read and write descriptors. Consequently, the
connects to wr_dts_slave
for the write DMA descriptor FIFO and rd_dts_slave for the read DMA
Write Data Mover
This is an
Avalon®-MM master port. The Write
Data Mover uses this
master to read data from either the on-chip or external
memory and then write data to the
The external memory controller is a
single-port RAM. Consequently, the Write Data Mover and the
Read Data Mover must share this port to assess external
FIFO in Descriptor Controller
Avalon®-MM slave ports for the FIFOs
in the Descriptor Controller. When the Read Data Mover
fetches the descriptors from system memory, the Read Data
Mover writes the descriptors to the FIFO using the wr_dts_slave and rd_dts_slave ports.
Control module in Descriptor
The Descriptor Controller control module
includes one transmit and one receive port for the read and
write DMAs. The receive port connects to RXM_BAR0. The
transmit port connects to the txs.
The receive path from the RXM_BAR0 connects
internally. RXM_BAR0 is not shown in the Platform Designer connections panel.
For the transmit path, both read and write DMA ports connect
to the txs externally.
These ports are visible in the Platform Designer connections panel.
Internal connection, not shown
master port passes the memory access from the
PCIe® host to
PCIe® BAR0. The host uses this
port to program the Descriptor Controller. Because this
reference design includes the Descriptor Controller as an
internal module, Platform Designer
does not display this port on the top-level connections panel.
64 KB Dual Port RAM
This is a 64-KB dual-port on-chip memory.
The address range is 0x0800_0000-0x0800_FFFF on the
Avalon®-MM bus. This
address is the source address for write DMAs or destination
address for read DMAs.
To prevent data corruption, software divides
the memory into separate regions for reads and writes. The
regions do not overlap.
Intel DDR3 or DDR4 controller
DDR3 or DDR4 Controller
This is a single-port DDR3 or DDR4
DMA Procedure Steps
Software running on the host completes the following steps to initiate
the DMA and verify the results:
Software allocates system memory for the descriptor table.
Software allocates system memory for the DMA data transfers.
Software writes the descriptors to the descriptor table in the
system memory. The DMA supports up to 128 read and 128 write descriptors. The
descriptor table records the following information:
Descriptor ID, ranging from 0-127
For the read DMA, the software initializes the system memory
space with random data. The Read Data Mover moves this data from the system
memory to either the on-chip or external memory. For the write DMA, the software
initializes the on-chip or external memory with random data. The Write Data
Mover moves the data from the on-chip or external memory to the system
Software programs the registers in the Descriptor Controller's
control module through BAR0. Programming specifies the base address of the
descriptor table in system memory and the base address of the FIFO that stores
the descriptors in the FPGA.
To initiate the DMA, software writes the ID of the last
descriptor to the Descriptor Controller's control logic. The DMA begins fetching
descriptors. The DMA starts with descriptor ID 0 and finishes with the ID of the
After data transfers for the last descriptor complete, the
Descriptor Controller writes 1'b1 to the Done
bit in the descriptor table entry corresponding to the last descriptor in the
PCIe® domain using the txs port.
Software polls the Done bit in
the descriptor table entry corresponding to the last descriptor. After the DMA
Controller writes the Done bit, the DMA
Controller calculates throughput. Software compares the data in the system
memory to the on-chip or external memory. The test passes if there are no
For simultaneous read and writes, the software begins the read
DMA operation before the write DMA operation. The DMA completes when all the
read and write DMAs finish.
Setting Up the Hardware
Power down computer number 1.
Plug the FPGA Development Kit card into a
PCIe® slot that supports Gen2 x4 or Gen3 x8.
Stratix® 10 10 FPGA
Development Kit, connectors J26 and J27 power the card. After inserting the card
into an available PCIe slot, connect 2x4- and 2x3-pin
PCIe® power cables from the power supply of computer number 1 to
the J26 and J27 of the
Connect a USB cable from computer number 2 to the FPGA
The Development Kit includes an on-board
Intel® FPGA Download Cable for FPGA programming.
To power up FPGA Development Kit via the
PCIe® slot, power on computer number 1. Alternatively, you can power up FPGA
Development Kit using the external power adapter that ships with the kit.
Cyclone® 10 GX FPGA
Development Kit, an on-board programmable oscillator is the clock source for
hardware components. Follow the instructions in Setting Up
Cyclone® 10 GX FPGA Programmable
Oscillator to program this oscillator.
On computer number 2, bring up the
Quartus® Prime programmer and configure the FPGA through an
Intel® FPGA Download Cable.
must reconfigure the FPGA whenever the FPGA Development Kit loses
To force system enumeration to discover the
PCIe® device, restart computer 1.
If you are using the
Stratix® 10 GX FPGA Development Kit, you might get the following error
message during BIOS initialization if the memory mapped I/O is only 4 GB: Insufficient PCI Resources Detected. To work around
this issue, enable Above 4G Decoding in the
BIOS Boot menu.
Programming the Intel Cyclone 10 GX FPGA Oscillator
Cyclone® 10 GX Development Kit includes a programmable oscillator that
you must set up before you can run the reference design for
Cyclone® 10 GX devices. A ClockController GUI allows
you to import the correct settings.
Installing the DMA Test Driver and Running the Linux DMA Software
In a terminal window on computer 1, change to the DMA driver
directory and extract AN829_driver.tar by
typing the following commands:
cd % <install_dir>/<device>/_PCIe<GenxN>DMA_<QuartusVer>_project/driver
% tar -xvf AN829_driver.tar
To install the Linux driver for the appropriate device family,
type the command:
% sudo./install <device_family>
Valid values for <device_family> are arria10, cyclone10, and stratix10.
To run the DMA application, type the following command:
The application prints the commands available to specify the
DMA traffic. By default, the software enables DMA reads, DMA writes, and
Simultaneous DMA reads and writes. The following table lists the available
Table 11. DMA Test Commands
Start the DMA.
Enable or disable read DMA.
Enable or disable write DMA.
Enable or disable simultaneous read
and write DMA.
Set the number of dwords per
descriptor. The legal range is 256-4096 dwords.
Set the number of descriptors. The
legal range is 1-127 descriptors.
By default, the reference design
selects on-chip memory. If select this command
consecutive runs switch between on-chip and external
Run a the DMA in a continuous
For example, type the following commands to specify 4096
dwords per descriptor and 127 descriptors:
% 5 4096
% 6 127
The following figures show the throughput for DMA reads,
DMA writes, and simultaneous DMA reads and writes:
Arria® 10 DMA
Cyclone® 10 GX DMA
Stratix® 10 DMA
Understanding PCI Express Throughput
The throughput in a PCI Express system depends on the following factors:
Flow control update latency
Devices forming the link
Protocol overhead includes the following three components:
128b/130b Encoding and
Decoding—Gen3 links use 128b/130b encoding. This encoding adds two
synchronization (sync) bits to each 128-bit data transfer. Consequently, the
encoding and decoding overhead is very small at 1.56%. The effective data rate
of a Gen3 x8 link is about 8 gigabytes per second (GBps).
Data Link Layer Packets
(DLLPs) and Physical Layer Packets (PLPs)—An active link also transmits DLLPs
and PLPs. The PLPs consist of SKP ordered sets which are 16-24 bytes. The DLLPs
are two dwords. The DLLPs implement flow control and the ACK/NAK protocol.
TLP Packet Overhead—The
overhead associated with a single TLP ranges from 5-7 dwords if the optional
ECRC is not included. The overhead includes the following fields:
The Start and
End Framing Symbols
The Sequence ID
A 3- or 4-dword
The Link Cyclic
Redundancy Check (LCRC)
of data payload
Figure 11. TLP Packet Format
Throughput for Posted Writes
The theoretical maximum throughput calculation uses the following formula:
Throughput = payload size / (payload size + overhead) * link data rate
Figure 12. Maximum Throughput for Memory Writes The graph shows the maximum throughput with different TLP header
and payload sizes. The DLLPs and PLPs are excluded from this calculation. For a
256-byte maximum payload size and a 3-dword header the overhead is five dwords.
Because the interface is 256 bits, the 5-dword header requires a single bus cycle.
The 256-byte payload requires 8 bus cycles.
The following equation shows maximum theoretical throughput:
The Device Control register, bits [7:5],
specifies the maximum TLP payload size of the current system. The Maximum Payload Size field of the Device
Capabilities register, bits [2:0], specifies the maximum permissible value
for the payload. You specify this read-only parameter, called Maximum
parameter editor. After determining the maximum TLP payload for the
current system, software records that value in the Device
Control register. This value must be less than the maximum payload
specified in the Maximum Payload Size field of the Device Capabilities register.
Understanding Flow Control for PCI Express
Flow control guarantees that a TLP is not transmitted unless the
receiver has enough buffer space to accept
There are separate credits for headers and payload data. A device needs sufficient
header and payload credits before sending a TLP. When the Application Layer in the
completer accepts the TLP, it frees up the RX buffer space in the completer’s
Transaction Layer. The completer sends a flow control update packet (FC Update DLLP)
to replenish the consumed credits to the
a device consumes all its credits, the rate of FC Update DLLPs to replenish header
and payload credits limits throughput. The flow control updates
depend on the maximum payload size and the latencies of two connected devices.
Throughput for Reads
PCI Express uses a split transaction model for reads. The read transaction
includes the following steps:
The requester sends a Memory
The completer sends out the
ACK DLLP to acknowledge the Memory Read Request.
The completer returns a
Completion with Data. The completer can split the Completion into multiple
Read throughput is typically lower than write throughput because reads
require two transactions instead of a single write for the same amount of data. The read
throughput also depends on the round trip delay between the time when the Application
Layer issues a Memory Read Request and the time when the requested data returns. To
maximize the throughput, the application must issue enough outstanding read requests to
cover this delay.
Figure 13. Read Request Timing
The figures below show the timing for Memory Read Requests (MRd)
and Completions with Data (CplD). The first figure shows the requester waiting
for the completion before issuing the subsequent requests.
results in lower throughput. The second figure shows the requester making
multiple outstanding read requests to eliminate the delay after the first data
delays results in
To maintain maximum throughput for the completion data packets, the
requester must optimize the following settings:
The number of completions
in the RX buffer
The rate at which the
Application Layer issues read requests and processes the completion data
Read Request Size
Another factor that affects throughput is the read request size. If a
requester requires 4 KB data, the requester can issue four, 1 KB read requests or a
single 4 KB read request. The 4 KB request results in higher throughput than the
four, 1 KB reads. The
Maximum Read Request Size value in Device Control register, bits
specifies the read request
Outstanding Read Requests
A final factor that can affect the throughput is the number of outstanding
read requests. If the requester sends multiple read requests to improve throughput,
the number of
header tags limits the number of outstanding read requests. To
Arria® 10 and
Cyclone® 10 GX read DMA
can use up to 16 header tags. The
Stratix® 10 read DMA can use up to 32
Understanding Throughput Measurement
To measure throughput, the software driver takes two timestamps. Software
takes the first timestamp shortly after the you type the ./run command. Software takes the second timestamp after the DMA completes and
returns the required completion status, EPLAST. If read DMA,
write DMA and simultaneous read and write DMAs are all enabled, the driver takes six
timestamps to make the three measurements.
Throughput Differences for On-Chip and External Memory
This reference design provides a choice between on-chip memory implemented in the FPGA fabric and external memory
available on the PCB. The on-chip memory supports separate read and write ports. Consequently,
this memory supports simultaneous read and the write DMAs.
The external memory supports a single port. Consequently, the external memory
does not support simultaneous read DMA and write DMA accesses. In addition, the latency of
external memory is higher than the latency of on-chip memory. These two differences between
the on-chip and external memory result in lower throughput for the external memory
To compare the throughput for on-chip and external memory, select command 7 for consecutive runs to switch between on-chip and external memory.
Document Revision History for AN 829: PCI Express Avalon -MM DMA Reference Design