5.4. Example: Specifying Bank-Selection Bits for Local Memory Addresses
The (b 0 , b 1 , ... ,b n ) arguments refer to the local memory address bit positions that the Intel® HLS Compiler should use for the bank-selection bits. Specifying the hls_bankbits(b 0, b 1, ..., b n) attribute implies that the number of banks equals 2 number of bank bits .
Bank 0 | Bank 1 | Bank 2 | Bank 3 | |
Word 0 | 00 000 | 01 000 | 10 000 | 11 000 |
Word 1 | 00 001 | 01 001 | 10 001 | 11 001 |
Word 2 | 00 010 | 01 010 | 10 010 | 11 010 |
Word 3 | 00 011 | 01 011 | 10 011 | 11 011 |
Word 4 | 00 100 | 01 100 | 10 100 | 11 100 |
Word 5 | 00 101 | 01 101 | 10 101 | 11 101 |
Word 6 | 00 110 | 01 110 | 10 110 | 11 110 |
Word 7 | 00 111 | 01 111 | 10 111 | 11 111 |
Example of Implementing the hls_bankbits Attribute
Consider the following example component code:
component int bank_arb_consecutive_multidim (int raddr,
int waddr,
int wdata,
int upperdim) {
int a[2][4][128] hls_numbanks(1);
#pragma unroll
for (int i = 0; i < 4; i++) {
a[upperdim][i][(waddr & 0x7f)] = wdata + i;
}
int rdata = 0;
#pragma unroll
for (int i = 0; i < 4; i++) {
rdata += a[upperdim][i][(raddr & 0x7f)];
}
return rdata;
}
As illustrated in the following figure, this code example generates multiple load and store instructions, and therefore multiple load/store units (LSUs) in the hardware. If the memory system is not split into multiple banks, there are fewer ports than memory access instructions, leading to arbitrated accesses. This arbitration results in a high loop initiation interval (II) value. Avoid arbitration blocks whenever possible because they consume a lot of FPGA area and can hurt the performance of your component.
By default, the Intel® HLS Compiler splits the memory into banks if it determines that the split is beneficial to the performance of your component. When the compiler generates a memory system, it uses the lower-order memory address bits to access the different memory banks. This behavior means that if you define your component memory structure so that the lowest order addresses are accessed in parallel, the compiler automatically infers the bank-selection bits for you.
This access pattern prevents stallable arbitration on the memory. In this case, preventing stallable arbitration reduced the II value to 1. In practice, this might mean that you store a matrix in column-major format instead of row-major format, if you intend to access multiple matrix rows concurrently.
Swapping the 128-element and 4-element dimension in the code example that follows results in no stallable memory arbitration.
|
The dimension that is accessed in parallel is moved to be the lowest-order dimension in the memory array. The load has a width of 128 bits, which is the same as four 32-bit loads.
If you cannot change your memory structure, you can use the hls_bankbits attribute to explicitly control how load and store instructions access local memory. As shown in the following modified code example and figure, when you choose constant bank-select bits for each access to the local memory a, each pair of load and store instructions needs to connect to only one memory bank. In this example, there are four 32-bit loads, which results in a memory system similar to the earlier example.
|
When specifying the word-address bits for the hls_bankbits attribute, ensure that the resulting bank-select bits are constant for each access to local memory. As shown in the following example, the local memory access pattern does not guarantee that the chosen bank-select bits are constant for each access. As a result, each pair of load and store instructions must connect to all the local memory banks, leading to stallable accesses.
|
In this case, the II is estimated to be approximately 64.