3.2.2.1. FP16 Supported Precision Formats

Intel® Agilex™ Variable Precision DSP Blocks User Guide

Download PDF

ID 683037

Date 11/17/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

The FP16 half-precision floating-point arithmetic functions support the following formats:

Flushed - use IEEE-754 half-precision format (binary16) for multiplier inputs and FP16 multiplication/addition/subtraction operations.
Extended - use IEEE-754 half-precision format (binary16) for multiplier inputs. Use extended format for FP16 multiplication/addition/subtraction operations.
Bfloat16 - multiplier inputs can be configured to accept 16-bit bfloat16 format or 19-bit extended bfloat16+ format. Use extended format for FP16 multiplication/addition/subtraction operations.

The following table shows the differences between the formats:

Table 18. Differences between Flushed, Extended, and Bfloat Formats
Features	Flushed	Extended	Bfloat16/Bfloat 16+
Input format (sign.exponent.mantissa)	1.5.10	1.5.10	1.8.7 or 1.8.10 (Bfloat16+)
FP16 operation format (sign.exponent.mantissa)	1.5.10	1.8.10	1.8.10
Input width	16 bit	16 bit	16 or 19 bit (Bfloat16+)
Minimum representable exponent	5'h01 - 5'h0f = -14	8'h01 - 8'h7f = -126	8'h01 - 8'h7f = -126
FP16 Subnormal	No support for subnormal. Subnormal result is flushed to zero.	Subnormal results can be represented as normal numbers	No support for subnormal. Subnormal result is flushed to zero.
Exception flags	Overflow, underflow, inexact, and invalid	Infinite, zero, inexact, and invalid	Overflow, underflow, inexact, and invalid
Invalid flag behavior	Asserted when there is an ill-defined operation	Asserted when there is an ill-defined operation or a qNaN input	Asserted when there is an ill-defined operation
Rounding	Round to nearest even (RNE)	RNE: if both FP16 operands are normal numbers if one of the FP16 operands is a subnormal number and mantissa product is ≥ 1 if one of the FP16 operands is a subnormal number and mantissa product = “0.1111111111\|1xxxxxxxxx” when using adder/subtractor operations Round to zero(RZ) if both FP16 operands are subnormal numbers if one of the FP16 operands is a subnormal number and mantissa product is ≤ 1	RZ