9.3.2.1. How the Lightweight Layout Transform Works

FPGA AI Suite Handbook

Download PDF

ID 863373

Date 11/21/2025

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

9.3.2.1. How the Lightweight Layout Transform Works

Since data usually arrives at the input feeder from a device as a raster scan channels-first format, the shape of the raw input tensors may have spatial dimension first and then channel last. For instance, for an 8-bit RGB image, the input shape is height by width by channel (HxWxC), where C is 3.

The PE array is capable of doing dot-product operations for a pair of c_vector-sized input features and filter at FP16 precision in each cycle. Therefore, the input tensors to the PE array engine are required to have a channel dimension that matches the value of c_vector.

When the lightweight layout transform is enabled, if an input pixel arrives at the input feeder with a channel dimension is less than c_vector,the channel dimension must be padded with zeros. For instance, an 8-bit RGB pixel in a 16 c_vector input is padded with 12 zeros.

Figure 22. Input Pixel with Zero Padding

Then, elements in this pixel are converted to FP16, to match with the PE precision. After the conversion, the lightweight layout transform for this input feature is complete and it is queued in an exit pixel buffer ready to be consumed by the PE.

While the preceding description described the lightweight transform for an input of one pixel at a time, in most scenarios multiple pixels can arrive at the input feeder on the data bus.

In the illustration that follows, three pixels arrive at the input feeder in parallel. Each pixel has three channels of 8-bit RGB value as an unsigned integer (U8). Therefore, the width of the input bus is 72 = 3 (parallel pixels) * 3 (channels per pixel) * 8 (bit per feature element).

Figure 23. Multiple Pixels In the Lightweight Layout Transform

Given $N$ pixels arriving at the input feeder in parallel, each pixel has width of $W_{pixel}$ , the input data bus width is $W_{bus} = N \times W_{pixel}$ .

When these pixels arrive at the input feeder, some graphs require a bias and scale to be applied to the input tensor. The lightweight layout transform can optionally apply a bias and scale to input values as they arrive.

Because the lightweight layout transform neither spends extensive logic resources to fold the spatial dimensions into channel dimension, nor does it track partial pixel transactions or buffering partial results, it splits the input into $N$ individual pixels, creates one of the transform processing pipeline described earlier for each of the pixel, and then selectively sends the transformed feature vector downstream to be consumed by PE.

Functional Restrictions

The lightweight layout transform has the following functional restrictions:

Tensors with channel dimensions greater than c_vector cannot be handled by the lightweight layout transform.
Input bus width must be a multiple of the pixel width.

Comparison to the Full Layout Transform

Resource savings in the lightweight transform compared to the full layout transform come mainly from the restriction of the input bus width to being a multiple of the pixel width and not supporting folding.

Conversely, the full layout transform has no restriction on the input bus width aside from being a multiple of the element width – i.e., multiple of 8 bits for U8 inputs, or multiple of 16 bits for FP16/U16 inputs. This means that the full layout transform must keep track of state information between transfers, and store partial results.

The following table compares the estimated performance using dla_compiler at 500MHz and aera for resnet-50-tf between AGX7_Performance_LayoutTransform.arch with hardware full layout transform and AGX7_LightweightLayoutTransform.arch. Arch files are slightly modified so that unused auxiliary modules are turned off and c_vector, k_vector, and stream buffer depth are aligned.

Table 25. Input Layout Transform Resource Utilization and Throughput Comparison
Input Layout Transform	ALMs	DSPs	ALUTs	M20K	Min Avg DDR	IP Throughput
Lightweight	60314	586	77506	2124	7547MB/s	149fps
Full	96730	594	139543	2853	8480MB/s	171fps

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

FPGA AI Suite Handbook

9.3.2.1. How the Lightweight Layout Transform Works

Functional Restrictions

Comparison to the Full Layout Transform