9.3.2.1. How the Lightweight Layout Transform Works
Since data usually arrives at the input feeder from a device as a raster scan channels-first format, the shape of the raw input tensors may have spatial dimension first and then channel last. For instance, for an 8-bit RGB image, the input shape is height by width by channel (HxWxC), where C is 3.
The PE array is capable of doing dot-product operations for a pair of c_vector-sized input features and filter at FP16 precision in each cycle. Therefore, the input tensors to the PE array engine are required to have a channel dimension that matches the value of c_vector.
While the preceding description described the lightweight transform for an input of one pixel at a time, in most scenarios multiple pixels can arrive at the input feeder on the data bus.
Given pixels arriving at the input feeder in parallel, each pixel has width of , the input data bus width is .
When these pixels arrive at the input feeder, some graphs require a bias and scale to be applied to the input tensor. The lightweight layout transform can optionally apply a bias and scale to input values as they arrive.
Because the lightweight layout transform neither spends extensive logic resources to fold the spatial dimensions into channel dimension, nor does it track partial pixel transactions or buffering partial results, it splits the input into individual pixels, creates one of the transform processing pipeline described earlier for each of the pixel, and then selectively sends the transformed feature vector downstream to be consumed by PE.
Functional Restrictions
- Tensors with channel dimensions greater than c_vector cannot be handled by the lightweight layout transform.
- Input bus width must be a multiple of the pixel width.
Comparison to the Full Layout Transform
Resource savings in the lightweight transform compared to the full layout transform come mainly from the restriction of the input bus width to being a multiple of the pixel width and not supporting folding.
Conversely, the full layout transform has no restriction on the input bus width aside from being a multiple of the element width – i.e., multiple of 8 bits for U8 inputs, or multiple of 16 bits for FP16/U16 inputs. This means that the full layout transform must keep track of state information between transfers, and store partial results.
| Input Layout Transform | ALMs |
DSPs |
ALUTs |
M20K |
Min Avg DDR |
IP Throughput |
|---|---|---|---|---|---|---|
Lightweight |
60314 |
586 |
77506 |
2124 |
7547MB/s |
149fps |
Full |
96730 |
594 |
139543 |
2853 |
8480MB/s |
171fps |