Intel® High Level Synthesis Accelerator Functional Unit Design Example User Guide

ID 683025
Date 7/19/2019
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

4.1.1.2. HLS AFU Avalon-MM Master Interfaces

An AFU can have two Avalon-MM master interfaces for accessing system memory. You must configure one Avalon-MM master as read-only; the other as write-only.

This requirement requires modifications to both the HLS component signature and the algorithm.

Because the smallest unit of data that an AFU can access in host memory is 64 bytes (512 bits), configure the Avalon-MM master interfaces by using HLS ihc::mm_master objects. For details about the parameters, refer to the Intel High Level Synthesis Compiler Reference Manual.

Then, modify the code found in the Simplified HLS Component to take advantage of the bandwidth afforded by the mandatory 512-bit data bus and access 16 32-bit values concurrently.

This access-size constraint means that if your vector length is not a multiple of 16 (equivalently, a multiple of 64 bytes), the host memory locations between the end of your vector and the next multiple of 16 fills with meaningless data.

For best performance, do not attempt read-modify-write operations. Because the design has two separate Avalon-MM master interfaces, the Intel® HLS Compiler assumes separate address spaces, and assumes no dependencies exist between the two Avalon-MM master interfaces.

The HLS AFU design example demonstrates how to connect with the 512-bit Avalon-MM master interfaces. You can do the following things:
  • Let the Intel® HLS Compiler abstract away that detail and assume the accesses are floats.
  • Assume host-memory accesses are 512-bit unsigned integers.

float Accesses

You can define the ihc::mm_master using an unsigned 512-bit float as the underlying data type.
Figure 16. Signature for float-based component
typedef ihc::mm_master<float, ihc::dwidth<512>, ihc::awidth<48>, 
                       ihc::latency<0>, ihc::aspace<1>, 
                       ihc::readwrite_mode<readonly>, ihc::waitrequest<true>,
                       ihc::align<64>, ihc::maxburst<4>> MasterReadFloat;
typedef ihc::mm_master<float, ihc::dwidth<512>, ihc::awidth<48>, 
                       ihc::latency<0>, ihc::aspace<2>, 
                       ihc::readwrite_mode<writeonly>,
                       ihc::waitrequest<true>, ihc::align<64>,
                       ihc::maxburst<4>> MasterWriteFloat;

component
hls_avalon_slave_component 
float fpVectorReduce_float(hls_avalon_slave_register_argument MasterReadFloat &masterRead,
                           hls_avalon_slave_register_argument MasterWriteFloat &masterWrite,
                           hls_avalon_slave_register_argument uint64_t size);

Use the following parameter settings to enable an HLS component to communicate with host memory from the AFU:

  • 48-bit wide address (awidth parameter)
  • DRAM: requires variable latency and wait-request signal (latency and waitrequest attributes)
  • 64 concurrent bytes can be read at once (align parameter)
  • Maximal burst size is 4 512-bit reads (maxburst parameter)
  • Separate physical Avalon-MM master ports (aspace parameter). Read-only and write-only (one Avalon-MM master of each) (readwrite_mode parameter)
Figure 17.  float-based body
#pragma unroll 16
for (int itr = 0; itr < 16; itr++)
{
  int idx = itr + (loop_idx * 16);
  if (idx < size)
  {
	float readVal = masterRead[idx];
	readSum += readVal;
     masterWrite[idx] = readVal + 1.0f;
  }
}
sum += readSum;

Access the mm_master inside the unrolled loop body 32 bits at a time. To verify that the compiler infers everything properly, look at the Component Viewer section of the generated HLS report.html report to verify that you have 512-bit burst-coalesced LSUs, and make sure that they are aligned. If you want to be certain that your loads occur 512 bits at a time, look at the simulation waveforms Figure 19.

Figure 18. HLS Report showing float-based component. Observe the coalesced Avalon-MM master interfaces.
Figure 19. ModelSim waveform showing host memory accesses of float-based component

ac_int Accesses

You can define the ihc::mm_master using an unsigned 512-bit ac_int as the underlying data type.

This signature is the same as the Figure 16, except that that mm_master type is ac_int instead of float.

Figure 20. Signature for ac_int-based component
typedef ac_int<512, false> uint512; // 512-bit unsigned integer
typedef ihc::mm_master<uint512, ihc::dwidth<512>, 
                       ihc::awidth<48>, ihc::latency<0>,
                       ihc::aspace<1>, ihc::readwrite_mode<readonly>,  
                       ihc::waitrequest<true>, ihc::align<64>, 
                       ihc::maxburst<4> > MasterReadAcInt;

typedef ihc::mm_master<uint512, ihc::dwidth<512>, 
                       ihc::awidth<48>, ihc::latency<0>, 
                       ihc::aspace<2>, ihc::readwrite_mode<writeonly>, 
                       ihc::waitrequest<true>, ihc::align<64>, 
                       ihc::maxburst<4> > MasterWriteAcInt;
	
component
hls_avalon_slave_component
float fpVectorReduce_ac_int(
   hls_avalon_slave_register_argument MasterReadAcInt &masterRead,
   hls_avalon_slave_register_argument MasterWriteAcInt &masterWrite,
   hls_avalon_slave_register_argument uint64_t size);

This method is more verbose, but it guarantees that all Avalon-MM master accesses coalesce to 512-bits wide. You can access the 32-bit parts of the 512-bit wide read result using the slc and set_slc functions provided by ac_int (refer to the ac_int Reference Manual, Mentor Graphics Corporation for more information on these functions). This component explicitly performs 512-bit reads and writes (line 1 and line 36).

Figure 21. ac_int-based bodyAs found in hls_afu.cpp
uint512 readVal = masterRead[loop_idx];
uint512 writeVal = 0;

#pragma unroll 16 // do each loop iteration concurrently.
                   // Use 16 iterations because there are 16
                   // 32-bit slices in each 512-bit word.
for (int itr = 0; itr < 16; itr++)
{
  int idx = itr + (loop_idx * 16);
  if (idx < size)
  {
    // grab a 32-bit piece of the 512-bit value that we read
    uint32 readVal_32 = readVal.slc<32>(itr * 32); 
  
    // use explicit type casting to process the bits pointed 
    // to by &readVal_32 as a float.
    void *readVal_32_ptr = &readVal_32;
    float readVal_f;
    float *readVal_f_ptr = &readVal_f;
    *readVal_f_ptr = *((float *) readVal_32_ptr);
    readSum += readVal_f;
  
    // increment and output
    float writeVal_f = readVal_f + 1.0f;
  
    // use explicit type casting to process the bits pointed 
    // to by &writeVal_f as a uint32.
    float *writeVal_f_ptr = &writeVal_f;
    uint32 *writeVal_32_ptr = (uint32 *) writeVal_f_ptr;
    uint32 writeVal32 = *writeVal_32_ptr;
  
    unsigned int bit_offset = itr * 32;
    writeVal.set_slc(bit_offset, writeVal32);
  }
}
masterWrite[loop_idx] = writeVal;