FPGA AI Suite Handbook

ID 863373
Date 11/21/2025
Public
Document Table of Contents

9.3.2.1. How the Lightweight Layout Transform Works

Since data usually arrives at the input feeder from a device as a raster scan channels-first format, the shape of the raw input tensors may have spatial dimension first and then channel last. For instance, for an 8-bit RGB image, the input shape is height by width by channel (HxWxC), where C is 3.

The PE array is capable of doing dot-product operations for a pair of c_vector-sized input features and filter at FP16 precision in each cycle. Therefore, the input tensors to the PE array engine are required to have a channel dimension that matches the value of c_vector.

When the lightweight layout transform is enabled, if an input pixel arrives at the input feeder with a channel dimension is less than c_vector,the channel dimension must be padded with zeros. For instance, an 8-bit RGB pixel in a 16 c_vector input is padded with 12 zeros.
Figure 22. Input Pixel with Zero Padding


Then, elements in this pixel are converted to FP16, to match with the PE precision. After the conversion, the lightweight layout transform for this input feature is complete and it is queued in an exit pixel buffer ready to be consumed by the PE.


While the preceding description described the lightweight transform for an input of one pixel at a time, in most scenarios multiple pixels can arrive at the input feeder on the data bus.

In the illustration that follows, three pixels arrive at the input feeder in parallel. Each pixel has three channels of 8-bit RGB value as an unsigned integer (U8). Therefore, the width of the input bus is 72 = 3 (parallel pixels) * 3 (channels per pixel) * 8 (bit per feature element).
Figure 23. Multiple Pixels In the Lightweight Layout Transform


Given pixels arriving at the input feeder in parallel, each pixel has width of , the input data bus width is .

When these pixels arrive at the input feeder, some graphs require a bias and scale to be applied to the input tensor. The lightweight layout transform can optionally apply a bias and scale to input values as they arrive.

Because the lightweight layout transform neither spends extensive logic resources to fold the spatial dimensions into channel dimension, nor does it track partial pixel transactions or buffering partial results, it splits the input into individual pixels, creates one of the transform processing pipeline described earlier for each of the pixel, and then selectively sends the transformed feature vector downstream to be consumed by PE.

Functional Restrictions

The lightweight layout transform has the following functional restrictions:
  • Tensors with channel dimensions greater than c_vector cannot be handled by the lightweight layout transform.
  • Input bus width must be a multiple of the pixel width.

Comparison to the Full Layout Transform

Resource savings in the lightweight transform compared to the full layout transform come mainly from the restriction of the input bus width to being a multiple of the pixel width and not supporting folding.

Conversely, the full layout transform has no restriction on the input bus width aside from being a multiple of the element width – i.e., multiple of 8 bits for U8 inputs, or multiple of 16 bits for FP16/U16 inputs. This means that the full layout transform must keep track of state information between transfers, and store partial results.

The following table compares the estimated performance using dla_compiler at 500MHz and aera for resnet-50-tf between AGX7_Performance_LayoutTransform.arch with hardware full layout transform and AGX7_LightweightLayoutTransform.arch. Arch files are slightly modified so that unused auxiliary modules are turned off and c_vector, k_vector, and stream buffer depth are aligned.
Table 25.  Input Layout Transform Resource Utilization and Throughput Comparison
Input Layout Transform

ALMs

DSPs

ALUTs

M20K

Min Avg DDR

IP Throughput

Lightweight

60314

586

77506

2124

7547MB/s

149fps

Full

96730

594

139543

2853

8480MB/s

171fps