9.2. Parallelizing Inference Using FPGA AI Suite with Multiple Lanes and Multiple Instances
After folding strategies for resource balancing, the next stage in design refinement is optimization of performance through parallelization. At this stage, the objective shifts from fitting the model to maximizing throughput and minimizing latency.
Architectural Vector Scaling
The primary mechanism for compute parallelism within the FPGA AI Suite IP is adjustment of c_vector and k_vector.
- k_vector specifies the number of filters the PE Array can process in parallel. Legal values: 4–128, with the requirement that the value be a multiple of c_vector.
- c_vector defines the dot-product width per PE, with supported values of 4, 8, 16, 32, and 64.
Increasing these parameters reduces folding in the compiler output and increases the amount of work executed per cycle. This technique improves MAC throughput until constrained by available DSPs, BRAM, or achievable fMAX.
Streaming Optimization and Multilane Scaling
As architectural width increases, external DDR bandwidth becomes the limiting factor. Transitioning to DDR-free streaming removes this bottleneck by routing feature maps directly into and out of the accelerator pipeline.
- The num_lanes parameter scales the PE array by duplicating data lanes. Legal values are 1, 2, and 4.
- The total stream buffer size scales proportionally with the number of lanes, requiring adjustment of stream_buffer_depth by the inverse of num_lanes.
- To achieve best performance with DDR-free inference mode, enable multilane operation.
This optimization reduces per-frame latency and increases effective throughput by aligning data movement with compute capacity.
Graph-Level Optimization
Graph-level optimization enhances runtime efficiency by restructuring execution flow and maximizing hardware utilization without modifying the accelerator IP itself. These optimizations target scheduling, memory access, and computational overlap.
- Batching allows multiple inputs to be processed as a grouped workload, amortizing setup and initialization overhead across multiple samples. This approach increases throughput for inference scenarios where latency per frame is less critical.
- Job Queuing ensures the accelerator remains active by overlapping host-to-device transfers, compute execution, and device-to-host transfers. This minimizes idle cycles within the pipeline.
- Operator Fusion combines adjacent layers (e.g., convolution, bias, activation) into a single execution kernel, reducing intermediate memory writes and reads. This decreases bandwidth demand and improves latency.
- Graph Partitioning splits the model across multiple compute partitions, enabling sections of the graph to execute concurrently on independent accelerator instances or host-device pipelines. Partitioning aligns with multi-instance deployment strategies.
- Weight Preloading ensures that frequently reused weights are cached in on-chip memory, reducing DDR bandwidth usage when the same filters are applied repeatedly across multiple inputs.
- Asynchronous Execution leverages runtime APIs to queue multiple graphs or subgraphs, overlapping workloads across FPGA compute engines, thereby achieving concurrency at the application layer.
- Dynamic Scheduling enables runtime-level decisions on task ordering, prioritizing low-latency jobs or throughput-optimized batches based on workload characteristics.
These tactics extend beyond folding and vector scaling by addressing bottlenecks at the graph compilation and scheduling level. When combined with architectural parallelism and streaming, graph-level optimizations create a balanced pipeline where compute, data movement, and scheduling are jointly optimized.
Multi-Instance Deployment
Scaling beyond a single FPGA AI Suite IP core is achieved by deploying multiple accelerator instances within the same FPGA fabric. This approach allows for concurrent processing of multiple inference tasks, improving overall throughput and enabling parallel execution of workloads.
- Instance Configuration: Each FPGA AI Suite IP instance operates independently, with its own control and status registers, DMA interfaces, and memory resources. This isolation ensures that each instance can process data independently without interference.
- Resource Allocation: Proper allocation of FPGA resources is crucial. Each instance requires a portion of the FPGA DSPs, BRAMs, and logic elements. The total resource usage scales with the number of instances deployed.
- Interconnect Design: The FPGA interconnect fabric must be designed to support multiple instances, ensuring that data can be routed efficiently between the host and each FPGA AI Suite IP instance.
- Clock Management: Each instance may operate on a separate clock domain, requiring careful management of clocking resources and timing constraints to ensure stable operation.
- Power Considerations: Deploying multiple instances increases the overall power consumption of the FPGA. Power budgeting must account for the cumulative power requirements of all active instances.
Composite Optimization Strategy
An optimized FPGA AI Suite design applies parallelization across multiple layers:
- Architectural vector scaling (c_vector, k_vector) for compute parallelism.
- Streaming with multilane (num_lanes) for bandwidth and latency optimization.
- Batching and job queues for graph-level utilization.
- Multi-instance replication for workload concurrency.
Parallelization in FPGA AI Suite IP is achieved through progressive application of these methods, each addressing specific bottlenecks in the compute, memory, or workload domain.
Next Optimization Step
After you parallelize operations as much as possible, the next optimization step is With parallelization optimizations complete, the next stage in optimizing your IP is to transforming the input data layout to further align model execution with hardware efficiency.