Conversion Rules for
ap_float
ap_float
You can convert between different sizes of
ap_float
data types through assignment or by using the
convert_to()
function. For example,
using namespace ihc;
ap_float<8, 32> myFloat = ...;
ap_float<3, 18> myFloat2 = myFloat; // use rounding rules defined by ap_float type
// use rounding rules defined in convert_to() function call
ap_float <3, 18> myFloat3 = myFloat.convert_to<3, 18, ihc::fp_config::FP_Round::RZERO>();
To convert between native types (for example,
float
,
double
) and
ap_float
data types, assign to or from the types. Type conversion in an assignment occurs according to the rules mentioned in
Table 1.
For two
ap_float
variables in a binary operation, the
ap_float
variable with the larger exponent bit-width is considered to be the
larger
variable. If two variables have the same exponent bit width, the variable with the larger mantissa bit-width is considered to be the
larger
variable. The operands are then unified to the
larger
type before the binary operation occurs.
Native floating-point data types and
ap_float
data types are converted to
ap_float
data types according to the rules in
Table 1.
The
Intel® oneAPI
also provides some operations that leave the precision of input types untouched and provide control over the output precision. For more details, refer to
Operations with Explicit Precision Controls.
DPC++/C++
Compiler Data Type
| From
ap_float To
Data Type | From
Data Type To
ap_float |
---|---|---|
ap_float with higher representable range
| Keep exponent equivalent.
The mantissa is rounded according to the rounding mode of the target
ap_float (with the higher representable range).
| +-Inf if the source of the conversion is out of the representable range. Otherwise, keep exponent equivalent.
The mantissa is rounded according to the rounding mode of the target
ap_float (with the smaller representable range).
|
float | Convert original
ap_float to
ap_float<8, 23> with the previous
ap_float rule, and then bit cast to
float .
| Bit-cast
float to
ap_float<8, 23> , and then convert to target
ap_float precision using the
ap_float to
ap_float rules described previously.
|
double | Convert original
ap_float to
ap_float<11, 52> with earlier
ap_float rule, and then bit cast to
double .
| Bit-cast
double to
ap_float<11, 52> , and then convert to the target
ap_float precision using the
ap_float to
ap_float rules described earlier.
|
long double (emulation only)
(Linux only)
| Convert the original
ap_float to
ap_float<15, 63> with the earlier
ap_float rule, and then insert a 1-bit 1 to the MSB of fraction bits to get an approximate equivalent of 80-bit representation of a
long double .
| Drop the explicit one fraction bit to convert
long double to 79-bit
ap_float<15, 63> .
|
C++ native integer types
| Truncate towards zero. Converting from
ap_float that is larger than the range of integer type is an undefined behavior.
| Round to the nearest, tie breaks to even. If the integer value is too large, the
ap_float value saturates to plus infinity.
|
You must avoid assigning the result of the
convert_to
function to another
ap_float
variable. This is because if the left-hand side of the assignment has a different exponent or mantissa widths than the ones specified in the
convert_to
function on the right-hand side, another conversion can occur.