11.2. DSP Builder Supported Floating-Point Data Types
|Type Name||Sign Width s||Exponent Width e||Exponent Bias b||Mantissa Width m||Description|
|float16_m10||5||15||10||Half-precision IEEE 754-2008)|
|float19_m10||8||127||10||Also known as TF32|
|float32_m23||8||127||23||Single-precision IEEE 754|
|float64_m52||11||1023||52||Double-precision IEEE 754|
DSP Builder represents the special values positive zero, negative zero, subnormals, and non-numbers in the standard IEEE 754 manner, namely:
- zero is m=0 and e=0 with s giving the sign.
- subnormal is m != 0 and e=0 with s giving the sign.
- infinity is m=0 and e=all ones with s giving the sign.
- not a number (NaN) is m != 0 and e=all ones.
Except for the preceding special values, the numerical value of a float type is given in terms of its bit-wise representation by:
f = (-1)s × 2(e-b) × (1 + (m / (2m_width)))
- e, b, and m are the base-10 equivalents of the respective bit sequences
- the field widths for each of s, e and m and the value of b are given for each format in the table
For example, for a 32-bit single precision floating point number with a bit-wise representation of 0x40300000:
s = 0b = 0 e = 10000000b = 128 m = 01100000000000000000000b = 3145728
f = (-1)^0 × 2^(128-127) × (1+(3145728/(2^23))) = 1 × 2 × (1+0.375) = 2.75