2.4.1.1. How Does the FPGA AI Suite Overlay Architecture Work?
In a typical deep learning application such as object detection on urban streets, the process begins with selecting a pre-trained deep learning model designed for real-time inference. These models are structured as computational models: a directed network of operations such as convolution, matrix multiplication, normalization, and activation functions. At an abstract level, the model functions as a mapping from input space to output space, denoted as a function F(x). In this context, the input x represents image or video data, and the output F(x) consists of bounding boxes indicating object locations along with associated confidence scores.
Deep learning models used for tasks such as object detection are typically composed of multiple layers, each representing a mathematical function with associated parameters. Trainable parameters—such as weights and biases—are adjusted during the training phase to improve model accuracy, while hyperparameters—such as convolutional filter size—remain fixed. These layers commonly include convolutional operations for feature extraction, spatial reduction mechanisms like pooling or striding, normalization layers for training stability, activation functions to introduce non-linearity, and output layers for generating predictions such as bounding boxes and class probabilities. When structured effectively, these components enable the model to perform complex, high-level tasks in a single inference pass.
Models obtained from public repositories are generally trained using modern machine learning frameworks. While implementations vary across frameworks, the underlying mathematical principles remain consistent. To facilitate deployment across heterogeneous hardware platforms—including CPUs, GPUs, and FPGAs—the OpenVINO™ toolkit converts these models into a unified Intermediate Representation (IR) using its Model Converter, enabling efficient and portable execution.
The IR comprises two critical files:
- an XML file detailing the model’s network topology, with OpenVINO™ layer operations and parameters represented as nodes (including their hyperparameters), and data flows represented as edges.
- a BIN file containing the trained model’s weights and biases.
The OpenVINO™ Intermediate Representation (IR) provides a standardized and optimized model format for efficient execution across heterogeneous hardware platforms, including CPUs and FPGAs. Within the FPGA AI Suite toolchain, the IR serves as the input to the front-end compiler, replacing the need to directly handle models from various third-party frameworks. Supported layers and hyperparameter ranges for the FPGA AI Suite Overlay IP are documented in the corresponding reference tables.
The FPGA AI Suite Compiler processes IR files to generate a Runtime Configuration Binary. This binary encodes the control logic required by the Overlay IP to manage data movement, coordinate execution, and perform inference efficiently.
Due to the variability in model architectures and layer complexity, a single instantiation of the Overlay IP may not be optimal for all models. Customization is enabled through an Architecture Description File (.ARCH), a human-readable configuration file that defines architectural parameters. This allows the overlay to be tailored for specific models or shared across multiple models without re-running Quartus synthesis.
The Overlay IP supports multiple system topologies, including configurations with high-performance host CPUs, embedded processors, or standalone operation. It also accommodates various data access patterns, such as direct streaming or external DDR memory access. For example, in edge vision applications, video frames can be streamed directly into the overlay, with inference results output in real time and intermediate data retained on-chip to minimize memory bandwidth usage.