Intel® Agilex™ Variable Precision DSP Blocks User Guide

ID 683037
Date 11/17/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

3.2.2.1. FP16 Supported Precision Formats

The FP16 half-precision floating-point arithmetic functions support the following formats:
  • Flushed - use IEEE-754 half-precision format (binary16) for multiplier inputs and FP16 multiplication/addition/subtraction operations.
  • Extended - use IEEE-754 half-precision format (binary16) for multiplier inputs. Use extended format for FP16 multiplication/addition/subtraction operations.
  • Bfloat16 - multiplier inputs can be configured to accept 16-bit bfloat16 format or 19-bit extended bfloat16+ format. Use extended format for FP16 multiplication/addition/subtraction operations.
The following table shows the differences between the formats:
Table 18.  Differences between Flushed, Extended, and Bfloat Formats
Features Flushed Extended Bfloat16/Bfloat 16+
Input format (sign.exponent.mantissa) 1.5.10 1.5.10 1.8.7 or 1.8.10 (Bfloat16+)
FP16 operation format (sign.exponent.mantissa) 1.5.10 1.8.10 1.8.10
Input width 16 bit 16 bit 16 or 19 bit (Bfloat16+)
Minimum representable exponent 5'h01 - 5'h0f = -14 8'h01 - 8'h7f = -126 8'h01 - 8'h7f = -126
FP16 Subnormal No support for subnormal. Subnormal result is flushed to zero. Subnormal results can be represented as normal numbers No support for subnormal. Subnormal result is flushed to zero.
Exception flags Overflow, underflow, inexact, and invalid Infinite, zero, inexact, and invalid Overflow, underflow, inexact, and invalid
Invalid flag behavior Asserted when there is an ill-defined operation Asserted when there is an ill-defined operation or a qNaN input Asserted when there is an ill-defined operation
Rounding Round to nearest even (RNE)
RNE:
  • if both FP16 operands are normal numbers
  • if one of the FP16 operands is a subnormal number and mantissa product is ≥ 1
  • if one of the FP16 operands is a subnormal number and mantissa product = “0.1111111111|1xxxxxxxxx”
  • when using adder/subtractor operations
Round to zero(RZ)
  • if both FP16 operands are subnormal numbers
  • if one of the FP16 operands is a subnormal number and mantissa product is ≤ 1
RZ