2.4.1.2. Parallelism in the FPGA AI Suite IP
This section outlines a step-by-step optimization process designed to convey a high-level understanding of computation parallelization within the FPGA AI Suite IP. The discussion includes the parameterization of the IP within the Architecture Description File (.arch file) and the corresponding implications on computation.
Considering a convolutional layer within an ML graph performing the operation, the input to this convolution is referred to as the input feature, the convolution kernel as a filter, and the output from the convolution as the output feature. Initially, in a naïve convolution approach, as the filter traverses the input feature, a single product computation is performed at each position and accumulated sequentially.
Once the entire input feature has been processed for a given filter, a single result within the output feature is produced. This method, computing only one product at a time, underutilizes available FPGA hardware resources.
The first optimization step involves increasing parallel computation capacity by unrolling computations along the input channel dimension. By duplicating dot-product computations c_vector (sometimes referred to as CVEC) times, multiple products can be calculated simultaneously. Each duplicated dot-product engine is termed a processing element (PE). Exploiting the FPGA’s spatial hardware architecture in this manner significantly enhances performance and trades-off the usage of more DSPs. Additional details can be found in Parameter: c_vector.
For computer vision tasks, input tensors to the ML graph are usually 8-bit pixels, or are organized in a more mathematically intuitive way matching the convolution layer, which is different in which the PE consumes them. Since the PE computes dot-product for c_vector number of FP16 feature elements at a time, for it to efficiently process input features, layout transform is responsible for converting the input features in memory to a desired shape and precision for the PE. Layout transform can be thought of as an optimized address lookup and datatype conversion for input features in memory so that feature elements can arrive at PE ready for dot-product. One important parallelism enabled by the layout transform is folding.
As the size of the dot-product increases across the input channel of the input feature, to fully utilize the PE, the input channel is more desirable to be close to the value of c_vector. For image-based tasks, typical input channels may only include three channels (Red, Green, Blue), whereas width and height dimensions are generally larger. To saturate the PE, the input feature may be reshaped so that the input channel is increased to match with or exceed the c_vector value. In the conceptual illustration for folding below, the original input feature on the left before folding, X, has a shallow input channel. The size of dot-product in the PE is larger than the number of input channels. Without folding, PEs may remain partially idle due to insufficient input channel data (underutilized dot-product in gray as the input is padded with zeros). By enabling folding, width and height dimensions are folded into the input channel dimension, resulting in an adjusted input to the PE with an expanded input channel dimension that eliminated as much zero-padding as possible. As a result, the folded input feature, X’, has a larger input channel that gives a better utilization of the dot-products in PE.
For additional information on layout transform, refer to Parameter: enable_layout_transform and Folding Input. For a variant of the layout transform feature and comparison between the Full Layout Transform, refer to the Transforming Input Data Layout.
The second optimization further capitalizes on FPGA spatial architecture by unrolling computations across the output channel dimension. Rather than computing a single dot product for one output channel, the PE itself is replicated, with each PE calculating the dot product for a distinct output channel. The input feature sequentially passes through each PE, with dot products computed using the respective filters specific to each output channel. Collectively, these PEs form a PE Array. This parallelism introduces a new parameter, k_vector, which is the total number of parallel PEs in the PE Array. For more details about this parameter, refer to Parameter: k_vector.
Given that input features typically have substantial width and height dimensions, a third optimization involves parallelizing computations along the height dimension (that is, dimensions represented as BxCxHxW). Architectures employing parallelism along the height dimension are referred to as multilane architectures.
Multilane architectures partition input features into segments along the height dimension, enabling parallel convolution computation within multiple lanes, significantly increasing throughput. Each lane comprises duplicated auxiliary modules (aux_modules) and k_vector number of PEs, computing convolution on height segments in parallel while sharing filter data among lanes. The segmented input features and filters traverse the 2D PE arrays in a systolic manner, and the computed dot products are aggregated to produce the output feature. The num_lanes parameter governs the quantity of lanes instantiated, theoretically enhancing throughput proportionally.
Although multilane architectures can substantially boost performance, careful consideration is necessary due to increased FPGA resource utilization and memory bandwidth demands. Specifically, DSP resource usage scales directly with num_lanes, and logic gate requirements grow correspondingly. Additionally, DDR bandwidth consumption rises with multilane architecture since more input features are being consumed by the PE arrays in parallel. Also, ML graphs with limited height dimensions or few convolutional layers might not gain performance benefits and could experience unnecessary resource overhead.
Multilane-enabled architectures within the architecture directory are identifiable by filenames ending with _Multilane (for example, AGX7_Performance_Multilane.arch). For comprehensive details on multilane configurations, see Parameter: num_lanes.
The material in this publication will help you to effectively balance performance against resource utilization, experiment with different lane configurations, and utilize the model analyzer tool.