AN 690: PCI Express* Avalon® -MM DMA Reference Design

ID 683824
Date 5/08/2017
Public

1.6.5. Peak Throughput

Signal Tap calculates the following throughputs for the Gen3 x8, 256-bit DMA running at 250 MHz for 142 cycles:

  • 7.1 GBps-for TX Memory Writes only, 256-byte payload, with the cycles required for transmitting read requests ignored in calculation
  • 6.8 GBps- for TX Memory Writes, 256-byte payload, with the cycles required for transmitting read requests included in calculation
  • 7.0 GBps-for RX Read Completions

In the Signal Tap run, the TxStValid signal never deasserts. The Avalon-MM application drives TLPs at the maximum rate with no idle cycles. In addition, the Hard IP for PCI Express continuously accepts the TX data. The Hard IP for PCI Express IP Core rarely deasserts the RxStValid signal. The de-assertions of the RxStValid signal occur because the host system does not return completion data fast enough. However, the Hard IP for PCI Express IP Core and the Avalon-MM application logic are able to handle the completions continuously as evidenced by long the stretches of back-to-back TLPs.

The average throughput for 2 MB transfers including the descriptor overhead is 6.4 GB/s in each direction, for a total bandwidth of 12.8 giga byte per second (GBps). The reported DMA throughput reflects typical applications. It includes the following overhead:

  • The time required to fetch DMA Descriptors at the beginning of the operation
  • Periodic sub-optimal TLP address boundaries allocation by the host PC
  • Storing DMA status back to host memory
  • Credit control updates from the host
  • Infrequent Avalon-MM core fabric stalls when the waitrequest signal is asserted

TX Theoretical Bandwidth Calculation

For TX interface, the write header is 15 bytes. The read header is 7 bytes. Because the TX interface is 256 bits, transferring either RX read and write headers requires 1 cycle. Transferring 256 bytes of payload data requires 8 cycles (256/32), for a total of 9 cycles per TLP. The following equation calculates the theoretical maximum TX bandwidth:

8 data cycles/9 cycles x 8 GBps = 7.111 GBps

Consequently, the header cycle reduces the maximum theoretical bandwidth by approximately 12%.

Signal Tap Observation

Signal Tap simulations show a peak throughput of 142 cycles at 250 MHz to transfer 2 MB of data. Disregarding the time required for Read Requests in the overhead calculation, results in the following equation:

(142 - 7 - 15)/(142 - 7) x 8 GBps) = 7.111 GBps

Including the time required for Read Requests in the overhead calculation, results in the following equation:

(142 - 7 - 15)/(142 x 8 GBps) = 6.76 GBps

RX Theoretical Bandwidth Calculation

For the RX interface, for a 2 MB transfer, the RxStValid signal is not asserted for 3 cycles. The transfer requires 15 Read Completion headers for 256-byte completions. For each Read Completion, the maximum RX bandwidth is represented by the following equation:

8 data cycles/9 cycles * 8GBps = 7.1111 GBps

Signal Tap Observation

Including the number of Completion headers and cycles when the RxStValid signal is not asserted as overhead, the actual observed bandwidth is represented by the following equation:

(142-15-3)/142 * 8GBps = 6.986 Gbps