Intel® High Level Synthesis Accelerator Functional Unit Design Example User Guide

ID 683025
Date 7/19/2019
Public
Document Table of Contents

4.2. Host Application Description

The host application uses the OPAE software API to communicate with the accelerator that runs inside your Intel PAC. OPAE is an Intel API that allows host applications to access the functionality of accelerators such as FPGA cards. You can learn more about the general usage of the OPAE software API in the Open Programmable Acceleration Engine (OPAE) C API Programming Guide.
Figure 22. Flow summary of a typical OPAE application

This host application is simplified compared with a production design host application. All API calls occur in the main source file. In a production design, you are more likely to make these API calls in libraries, similarly to the DMA AFU design. For clarity, all the host code is in a single source file, hls_afu/sw/src/hls_afu_host.c. The headings in this section match the headings in hls_afu_host.c.

Table 2.  AFU Avalon MM Slave memory map
Slave Name Address Range
Device feature header slave

(afu_id_avmm_slave_0)

0x0000 to 0x003F
HLS component

(fpVectorReduce_ac_int_internal_0 or fpVectorReduce_float_internal_0)

0x0040 to 0x007F

Preamble/Header Files

The first section of the host code includes necessary libraries, and defines several constant address offsets. The design derives the CSR constants for the HLS component from the constants in fpVectorReduce_float_csr.h, which the HLS compiler emits. Because the HLS component's slave interface shares a memory space with the AFU ID MM slave, a base offset ensures that each register in the AFU ID MM slave and the HLS component has a unique address.

Discover/Grab FPGA Resources

This block of code is boilerplate. The design queries the FPGA hardware for available accelerators, and if the design finds the accelerator required by the host application, the host application attempts to control the FPGA device. In this design, the host also exercises the AF registers that the Acceleration Stack requires. The HLS component does not implement these registers, which are in the AFU device feature header Avalon-MM slave.

Setup and Populate Host-Side Memory

This block of code configures a contiguous host-side memory buffer that the AF can access. When you run an Acceleration Stack host, make sure that you configure it to use 2 MB hugepages using this command (you do not need this command if you are running using ASE):

# sudo sh -c "echo 20 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages"

This command allows the host application to allocate 2MB pinned continuous buffers in its memory.

The fpgaPrepareBuffer() function allocates the host-side buffer that the design shares with the AF. This function allocates a block of memory starting at a user-specified address. Additionally, it guarantees that the memory block is 64-byte (512 bits) aligned, which makes the AFU accesses efficient. fpgaGetIOAddress() gets a pointer that the AF can use to access the same memory space as the host. The host can populate the block of RAM as it does for any other array.

Setup Interrupts

This design uses an interrupt framework to allow the AF to report to the host when it finishes processing. The HLS component generates the required interrupt in this design, so the host needs to write into the HLS component's slave memory space to enable the interrupt. First, the host checks if the interrupt is already enabled, by reading the CSR_INTERRUPT_ENABLE register. Refer to the Hello Interrupt AFU example included with Intel Acceleration Stack for more details about interrupts.

Start AF and Wait for Result

To start the AF, the host writes input variables into the HLS component's slave space (input data starting address, output data starting address, data size). Then it writes a 1 into the START bit in the HLS component's slave space. Using the poll API, the host waits for the AF to finish.

Check to Make Sure that the Calculation Was Correct

The host checks that the interrupt returned correctly (or did not time out) and verifies that the output memory contains the expected values. This design also prints out some debug data at the end of the memory space, to illustrate that AFs can only perform 512-bit reads and writes. If you pass a vector whose length is not a multiple of 512 bits, (64 bytes), the design overwrites some data in the output vector memory space.

Cleanup

Finally, the host application disposes of the resources that it allocated during its execution.