Intel® High Level Synthesis Compiler Pro Edition: Reference Manual
5.4.3.2. Memory-Access Coalescing and Load-Store Units
The Intel® HLS Compiler might coalesce multiple memory accesses into a wider memory access to save on the number of LSUs instantiated.
When the compiler coalesces the memory accesses, it is referred to as static coalescing because the coalescing occurs at compile time. This static coalescing contrasts with the dynamic coalescing done by a burst-coalesced LSU.
The compiler typically attempts static coalescing when it detects multiple memory operations that access consecutive locations in memory. This coalescing is usually beneficial because it reduces the number of LSUs that compete for a shared memory interface.
The compiler coalesces memory accesses only up to the width of the memory interface that is being accessed. For an external memory interface, the maximum width is predetermined by the properties of the external memory that you are accessing. For a component (internal) memory interface, the maximum width can be set by the compiler based on the memory geometry that the compiler creates. For more details about component memories, see Component Memories (Memory Attributes).
For the following code example, the Intel® HLS Compiler statically coalesces the four 4-byte-wide load operations into one 16-byte-wide load operation. A similar coalescing is done for the four store operations. Coalescing the load and store operations reduces the number of required accesses to the Avalon MM Host interfaces by 4.
#include "HLS/hls.h"
component void
static_coalescing(ihc::mm_host<int, ihc::dwidth<128>, ihc::awidth<32>,
                               ihc::aspace<1>, ihc::latency<0>> &in,
                  ihc::mm_host<int, ihc::dwidth<128>, ihc::awidth<32>,
                               ihc::aspace<2>, ihc::latency<0>> &out,
                  int i) {
  // Four loads statically coalesced into one 16 bytes wide load
  int a1 = in[3 * i + 0];
  int a2 = in[3 * i + 1];
  int a3 = in[3 * i + 2];
  int a4 = in[3 * i + 3];
  // Four stores statically coalesced into one 16 bytes wide store
  out[3 * i + 0] = a4;
  out[3 * i + 1] = a3;
  out[3 * i + 2] = a2;
  out[3 * i + 3] = a1;
}
 
   
     component int alignedRead(AHost &A, unsigned int i)
{
    hls_register int memory_line[PARALLEL_WORDS] = {};
    int word_base = PARALLEL_WORDS * (i / PARALLEL_WORDS);
#pragma unroll
    for (int j = 0; j < PARALLEL_WORDS; j++)
    {
        int idx_base = PARALLEL_WORDS * ((i + j) / PARALLEL_WORDS);
        bool doesntWrap = idx_base == word_base;
        if (doesntWrap)
        {
            memory_line[j] = A[i + j];
        }
    }
    ...
}
 
  The compiler infers a nonaligned LSU because the compiler cannot determine that all values of j are read from the same word of memory A. Although the range of j matches exactly one word of memory, the accesses might wrap to another word. More than 1 read might be required. If you code your access with a base that aligns with the word width, the compiler can guarantee no wrapping occurs and so the compiler gives you an aligned LSU.