Intel® High Level Synthesis Compiler Pro Edition: Reference Manual

ID 683349
Date 6/20/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

4.4.3.1. Load-Store Unit Types

The Intel® HLS Compiler determines the types of load-store units (LSUs) to instantiate and whether to coalesce memory accesses based on from the memory access pattern that the compiler infers.

The Intel® HLS Compiler instantiates the following the types of LSUs:
Burst-coalesced LSUs
Nonaligned burst-coalesced LSUs
The Intel® HLS Compiler typically instantiates burst-coalesced LSUs for accessing variable-latency Avalon® MM Host interfaces.
Pipelined LSUs
Never-stall pipelined LSUs
The Intel® HLS Compiler typically instantiates pipelined LSUs for accessing fixed-latency Avalon® MM Host interfaces or on-chip memories.

Click LSUs in the System Viewer (in the High-Level Design Reports) to see which types of LSU the compiler instantiated for your component.

Figure 4. Example of LSU Information Provided in the System Viewer


Burst-Coalesced Load-Store Units

By default, the compiler infers burst-coalesced load-store units (LSUs) for any variable-latency Avalon® MM Host interface.

A burst-coalesced LSU dynamically buffers contiguous memory requests until the largest possible burst can be made or until the LSU receives no new requests for a given period of time. The largest possible burst is defined by the ihc::maxburst parameter. For noncontiguous memory requests, a burst-coalesced LSU flushes the buffer between requests.

Burst-coalsced LSUs provide efficient, variable-latency access to memories outside of your component. However, they require a considerable amount of FPGA resources.

The following code example results in the Intel® HLS Compiler instantiating two burst-coalesced LSUs by default (because of the variable-latency Avalon® MM Host interface):
#include "HLS/hls.h"

component void
burst_coalesced(ihc::mm_host<int, ihc::dwidth<64>, ihc::awidth<32>,
                             ihc::aspace<1>, ihc::latency<0>> &in,
                ihc::mm_host<int, ihc::dwidth<64>, ihc::awidth<32>,
                             ihc::aspace<2>, ihc::latency<0>> &out,
                int i) {
  int value = in[i / 2]; // Burst-coalesced LSU
  out[i] = value; // Burst-coalesced LSU
}

Depending on the memory access pattern and other attributes, the compiler might modify a burst-coalesced LSU to be a nonaligned burst-coalesced LSU.

Nonaligned Burst-coalesced LSUs

When a burst-coalesced LSU can access a memory that is not aligned to the external memory word size, the Intel® HLS Compiler creates a nonaligned burst-coalesced LSU. Nonaligned LSUs typically require more FPGA resources to implement than aligned LSUs. The throughput of a nonaligned LSU might be reduced if it receives many unaligned requests.

The following code example results in two nonaligned burst-coalesced LSUs:
#include "HLS/hls.h"

struct State {
  int x;
  int y;
  int z;
};

component void
static_coalescing(ihc::mm_host<State, ihc::dwidth<128>, ihc::awidth<32>,
                               ihc::aspace<1>, ihc::latency<0>> &in,
                  ihc::mm_host<State, ihc::dwidth<128>, ihc::awidth<32>,
                               ihc::aspace<2>, ihc::latency<0>> &out,
                  int i) {
  out[i] = in[i]; // Two Nonaligned Burst-coalesced LSUs

The figure that follows (Nonaligned Memory Accesses) shows the external memory contents for the previous code example and the nonaligned burst-coalesced LSUs in the component pipeline.

The data type that is read and written is a 96-bit-wide struct. The external memory width is 128 bits. This difference between the read/write data width and the external memory width forces some of the memory requests to span two consecutive memory words.

A nonaligned burst-coalesced LSU can detect that discrepancy and serve such memory requests as needed while still buffering contiguous requests until the largest possible burst can be made.
Figure 5. Nonaligned Memory Accesses


Pipelined Load-Store Units

By default, the compiler infers pipelined load-store units (LSUs) for any fixed-latency Avalon® MM Host interface and on-device memories

In a pipelined LSU, requests are submitted when they are received and no buffering occurs. Pipelined LSUs are also used for accessing memories inside your component.

You can tell the compiler to instantiate pipelined LSUs for variable-latency MM Host interfaces. However, variable-latency interface access with pipelined LSUs might reduce throughput because pipelined LSUs do not combine sequential memory requests into bursts.

Memory accesses are pipelined, so multiple requests can be in flight at the same time.

The following code example results in the Intel® HLS Compiler instantiating four pipelined LSUs:
#include "HLS/hls.h"

component void 
pipelined(ihc::mm_host<int, ihc::dwidth<64>, ihc::awidth<32>,
                            ihc::aspace<1>, ihc::latency<2>> &in,
          ihc::mm_host<int, ihc::dwidth<64>, ihc::awidth<32>,
                            ihc::aspace<1>, ihc::latency<2>> &out,
          int gi, int li) {
  int lmem[1024];

  int res = in[gi]; // Pipelined LSU
  for (int i = 0; i < 4; i++) {
    lmem[li - i] = res; // Pipelined LSU
    res >>= 1;
  }

  res = 0;
  for (int i = 0; i < 4; i++) {
    res ^= lmem[li - i]; // Pipelined LSU
  }

  out[gi] = res; // Pipelined LSU
}

Never-Stall Pipelined LSUs

If a pipelined LSU is connected to a memory inside the component or to a fixed-latency MM Host interface without arbitration, a never-stall LSU is created because all accesses to the memory take a fixed number of cycles that are known to the compiler.

The following code example results in the Intel® HLS Compiler instantiating three never-stall pipelined LSUs for accessing array lmem.
#include "HLS/hls.h"

component void
neverstall(ihc::mm_host<int, ihc::dwidth<128>, ihc::awidth<32>,
                             ihc::aspace<1>, ihc::latency<0>> &in,
          ihc::mm_host<int, ihc::dwidth<128>, ihc::awidth<32>,
                            ihc::aspace<1>, ihc::latency<0>> &out,
          int gi, int li) {
  int lmem[1024];
  for (int i = 0; i < 1024; i++)
    lmem[i] = in[i]; // Pipelined never-stall LSU

  out[gi] = lmem[li] ^ lmem[li + 1]; // Pipelined never-stall LSU
}