Intel Acceleration Stack for Intel® Xeon® CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual

ID 683193
Date 11/04/2019
Public
Document Table of Contents

1.2.3. Memory and Cache Hierarchy

The CCI-P protocol provides a cache hint mechanism. Advanced AFU developers can use this mechanism to tune for performance. This section describes the memory and cache hierarchy for both the Intel® FPGA PAC and Integrated FPGA Platform. The CCI-P provided control mechanisms are discussed in the " Intel® FPGA PAC" and "Integrated FPGA Platform" sections, below.

Intel® FPGA PAC

Figure 4.  Intel® FPGA PAC Memory Hierarchy
The above figure shows an Intel® FPGA PAC memory and cache hierarchy in a single processor Intel® Xeon® platform. Intel® FPGA PAC has two memory nodes:
  • Processor Synchronous Dynamic Random Access Memory (SDRAM), referred to as host memory
  • FPGA attached SDRAM, referred to as local memory
AFU decides if the request must be routed to local memory or CPU memory.
Local Memory (B.1) is in a separate address space from host memory (A.2). AFU requests, targeted to local memory, are always serviced directly by the SDRAM (denoted (B.1) in Figure 4 ).
Note: There is no cache along the local memory access path.

AFU requests targeted to CPU memory over PCIe, can be serviced by the Processor-side, as shown in Figure 4.

For the Last Level Cache (denoted (A.1)):
  • A read request received has a lower latency than reading from the SDRAM (denoted (A.2)).
  • A write request hint can be used to instruct the Last Level Cache how to treat the data written (for example: cacheable, non-cacheable, and locality).

If a request misses the Last Level Cache, it can be serviced by the SDRAM.

For more information, refer to the WrPush_I request in the CCI-P protocol definition.

Integrated FPGA Platform

Figure 5.  Integrated FPGA Platform Memory Hierarchy
Figure 5 shows the three level cache and memory hierarchy seen by an AFU in an Integrated FPGA Platform with one Intel® Xeon® processor. A single processor Integrated FPGA Platform has only one memory node, the Processor-side: SDRAM (denoted (A.3)). The Intel UPI coherent link extends the Intel® Xeon® processor’s coherency domain to the FPGA as shown by the green dotted line in Figure 5. A UPI caching agent keeps the FPGA cache in FIU, coherent with the rest of the CPU memory. An upstream AFU request targeted to CPU memory can be serviced by:
  • FPGA Cache (A.1)—Intel UPI coherent link extends the Intel® Xeon® processor’s coherency domain to the FPGA cache. Requests hitting in FPGA cache has the lowest latency and highest bandwidth. AFU requests that use VL0 virtual channel and VA requests that are selected to use UPI path, look up the FPGA cache first, and only upon a miss are sent off the chip to the processor.
  • Processor-side cache (A.2)—A read request that hits the processor-side cache has higher latency than FPGA cache, but lower latency than reading from Processor SDRAM. A write request hint can be used to direct the write to processor-side cache. For more information, refer to WrPush_I request in CCI-P protocol definition.
  • Processor SDRAM (A.3)—A request that misses the processor-side cache is serviced by the SDRAM.

The data access latencies increase from (A.1) to (A.3).

Note: Most AFUs achieve maximum memory bandwidth by choosing the VA virtual channel, rather than explicitly selecting using VL0, VH0 and VH1. The VC steering logic implemented in the FIU has been tuned for the platform, it takes into account the physical link latency and efficiency characteristics, physical link utilization and traffic distribution to provide maximum bandwidth.

One limitation of the VC steering logic is that it does not factor the cache locality in the steering decision. The VC steering decision is made before the cache lookup. This means a request can get steered to VH0 or VH1 even though the cache line is in the FPGA cache. Such a request may incur an additional latency penalty, because the processor may have to snoop the FPGA cache in order to complete the request. If the AFU knows about the locality of accesses, then it may be beneficial to use VL0 virtual channel to exploit the cache locality.