FPGA AI Suite Handbook

ID 863373
Date 11/21/2025
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

9.4.2. Improving Layer Accuracy by using Mixed Precision

For ML tasks that are sensitive to precision, although using a lower precision saves area, the inference accuracy may be impacted. Mixed precision feature enables running designated layers in the ML graph to run at a higher precision to achieve a better accuracy.

Figure 25. Conversion to high-precision block floating point

The following diagram illustrates the conversion from floating point (fp16) to high-precision block floating point (in this example, 2 x INT9-BFP).

Since the block floating point block alignment step uses a mantissa width larger than the fp16 mantissa, there is little to no loss of precision.

The only situation in which the high-precision blocked values lose mantissa precision relative to the fp16 inputs occurs when values of very different magnitude (i.e. having very different exponents) are blocked together. In this situation, a large bit shift is required to block align the mantissas, which can cause some low-precision bits of smaller values in the block to be lost. The 7th blocked value in the diagram illustrates this case.

Using the high precision block floating point numerical decomposition, a PE array parameterized to handle INT9-BFP can perform convolutions at INT17-BFP precision on select layers. The table below summarizes what "high precision BFP" entails for different FPGA AI Suite IP arch_precision parameter values.

Table 28.  High-Precision BFP vs. Default Precision BFP

arch_precision

Block floating point

High precision BFP

FP11

INT7-BFP

INT13-BFP(not supported)

FP12AGX

INT8-BFP

INT15-BFP

FP13AGX

INT9-BFP

INT17-BFP

FP16

INT12-BFP

INT23-BFP(not supported)

A high-precision convolution layer has 4x the computational cost of a default precision convolution layer. With high precision BFP decomposition, both features and filters are represented as the sum of two terms. The resulting feature-filter product of sums has four terms.

The computational cost can be reduced by using high precision BFP for only the features, leaving the filters at default precision. Such a layer has 2x the computational cost of a default precision layer.