Intel Acceleration Stack for Intel® Xeon® CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual

ID 683193
Date 11/04/2019
Public
Document Table of Contents

1.3.13.1.1. Memory Write Fence

CCI-P WrFence Request

CCI-P defines a WrFence request type, this can be used for all VCs, including VA. The FIU implementation of WrFence stalls the C1 channel and hence block all write streams sharing the CCI-P write path. Furthermore, a WrFence request guarantees global observability, which means for PCIe paths, FIU generates a Zero Length Read (ZLR) to push out the writes. Given this, WrFence requests could incur long stalls on the C1 Channel. To avoid this from happening, restrict its use to synchronization points in your AFU's data flow.
  • WrFence guarantees that all interrupts or writes preceding the fence are committed to memory before any writes following the Write Fence are processed.
  • A WrFence is not re-ordered with other memory writes, interrupts, or WrFence requests.
  • WrFence provides no ordering assurances with respect to Read requests.
  • A WrFence does NOT block reads that follow it. In other words, memory reads can bypass a WrFence. This rule is described in the "Memory Requests" section.
  • WrFence request has a vc_sel field. This allows determination of which virtual channels the WrFence is applied to. For example, if moving the data block using VL0, only serialize with respect to other write requests on VL0. That is, you must use WrFence with VL0. Similarly, if using memory writes with VA, then use WrFence with VA.
  • A WrFence request returns a response. The response is delivered to the AFU over RX C1 and identified by the resp_type field. Since reads can bypass a WrFence, to ensure the latest data is read in a write followed by read (RaW hazard), issue a WrFence and then wait for the WrFence response before issuing the read to the same location.

Write Response Counting

AFU implements the memory write barrier, it can do this by waiting for all outstanding writes to complete before sending the next write after the barrier. The logic to track the outstanding writes can be a simple counter that increments on request and decrements on response, hence the name "write response counting". Write responses only guarantee local observability. This technique only works for implementing a memory barrier on a write stream targeted to a single VC (for example: VL0, VH0, VH1). This technique should not be used if a write stream uses VA or a mix of VCs.
Note: In cases such as these, you should implement a write fence instead.
One of the key advantages of this technique is that AFU can implement fine grained barriers. For example, if AFU has two independent write streams, it can implement a write response tracker per stream. If write stream 1 needs a memory barrier, it would only stall the writes from stream 1 while continuing to send writes from stream 2. The Mdata field can be used to encode the stream id. Such a fine grained memory barrier may:
  • Minimize the latency cost of the barrier because it would only wait on specific outstanding writes to complete, instead of all of them.
  • Improve link utilization because unrelated write streams can continue to make forward progress.