Intel Acceleration Stack for Intel® Xeon® CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual

ID 683193
Date 11/04/2019
Public
Document Table of Contents

1.3.4. UMsg

Attention: UMsg is only supported in the Integrated FPGA Platform.

UMsg provides the same functionality as a spin loop from the AFU, without burning the CCI-P read bandwidth. Think of it as a spin loop optimization, where a monitoring agent inside the FPGA cache controller is monitoring snoops to cache lines allocated by the driver. When it sees a snoop to the cache line, it reads the data back and sends an UMsg to the AFU.

UMsg flow makes use of the cache coherency protocol to implement a high speed unordered messaging path from CPU to AFU. This process consists of two stages as shown in Figure 8.

The first stage is initialization, this is where SW pins the UMsg Address Space (UMAS) and shares the UMAS start address with the FPGA cache controller. Once this is done, the FPGA cache controller reads each cache line in the UMAS and puts it as shared state in the FPGA cache.

The second stage is actual usage, where the CPU writes to the UMAS. A CPU write to UMAS generates a snoop to FPGA cache. The FPGA responds to the snoop and marks the line as invalid. The CPU write request completes, and the data become globally visible. A snoop in UMAS address range, triggers the Monitoring Agent (MA), which in turn sends out a read request to CPU for the Cache Line (CL) and optionally sends out an UMsg with Hint (UMsgH) to the AFU. When the read request completes, an UMsg with 64B data is sent to the AFU.
Figure 8. UMsg Initialization and Usage Flow

Functionally, UMsg is equivalent to a spin loop or a monitor and mwait instruction on an Intel Xeon processor.

Key characteristics of UMsgs:
  • Just as spin loops to different addresses in a multi-threaded application have no relative ordering guarantee, UMsgs to different addresses have no ordering guarantee between them.
  • Every CPU write to a UMAS CL, may not result in a corresponding UMsg. The AFU may miss an intermediate change in the value of a CL, but it is guaranteed to see the newest data in the CL. Again it helps to think of this like a spin loop: if the producer thread updates the flag CL multiple times, it is possible that polling thread misses an intermediate change in value, but it is guaranteed to see the newest value.
Below is an example usage. Software updates to a descriptor queue pointer that may be mapped to an UMsg. The pointer is always expected to increment. The UMsg guarantees that the AFU sees the final value of the pointer, but it may miss intermediate updates to the pointer, which is acceptable.
  1. The UMsg uses the FPGA cache, as a result it can cause cache pollution, a situation in which a program unnecessarily loads data into the cache and causes other needed data to be evicted, thus degrading performance.
  2. Because the CPU may exhibit false snooping, UMsgH should be treated as a hint. That is, you can start a speculative execution or pre-fetch based on UMsgH, but you should wait for UMsg before committing the results.
  3. The UMsg provides the same latency as an AFU read polling using RdLine_S, but it saves CCI-P channel bandwidth which can be used for read traffic.