|
IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating-Point Arithmetic (continued) IA-64 FLOATING-POINT DIVIDE AND REMAINDER The floating-point divide algorithms for the IA-64 architecture are based on the Newton-Raphson iterative method and on polynomial evaluation. If a/b needs to be computed and the Newton-Raphson method is used, a number of iterations first calculate an approximation of 1/b, using the function
The iteration step is
![]() where the subscript rn denotes the IEEE rounding to the nearest mode.
Once a sufficiently good approximation y of 1/b is determined, q = a In order to show that the final result generated by the floating-point divide algorithm represents the correctly rounded value of the infinitely precise result a/b in any rounding mode, it was proved (by methods described in [3] and [4]) that the exact value of a/b and the final result q* of the algorithm before rounding belong to the same interval of width 1/2 ulp, adjacent to a floating-point number. Then
where rnd is any IEEE rounding mode. The algorithms proposed for floating-point divide (as well as for square root) are designed, as seen from the Newton-Raphson iteration step shown above, based on the availability of the floating-point multiply-add operation, fma, that performs both the multiply and add operations with only one rounding error. Two variants of floating-point divide algorithms are provided for single precision, double precision, double-extended and full register file format precision, and SIMD single precision. One achieves minimum latency, and one maximizes throughput. The minimum latency variant minimizes the execution time to complete one operation. The maximum throughput variant performs the operation using a minimum number of floating-point instructions. This variant allows the best utilization of the parallel resources of the IA-64, yielding the minimum time per operation when performing the operation on multiple sets of operands.
Double Precision Floating-Point Divide Algorithm All the computational steps are performed in full register file double-extended precision, except for steps (11) and (12), which are performed in full register file double-precision, and step (13), performed in double-precision. The approximate values are shown on the right-hand side.
![]() The first step is a table lookup performed by frcpa, which gives an initial approximation y0 of 1/b, with known relative error determined by m = 8.886. Steps (3) and (4), (6) and (7), and (9) and (10) represent three iterations that generate increasingly better approximations of 1/b in y1 , y2 , and y3. Note that step (2) above is exact: y0 has 11 bits (read from a table), and a has 53 bits in the significand, and thus the result of the multiplication has at most 64 bits that fit in the significand. Steps (5), (8), and (11) calculate three increasingly better approximations q1, q2 and q3 of a/b. Evaluating their relative errors and applying other theoretical properties [4], it was shown that q4 = (a/b)rnd in any IEEE rounding mode rnd, and that the status flag settings and exception behavior are IEEE compliant. Assuming that the latency of all floating-point operations is the same, the algorithm takes seven fma latencies: steps (2) and (3) can be executed in parallel, as can steps (4), (5), (6); then (7), (8), (9) and also (10) and (11). The implementation of this algorithm in assembly language is shown next.
![]() Note that the output predicate p6 of instruction (1) (frcpa) predicates all the subsequent instructions. Also, the output register of frcpa (f8) is the same as the output register of the last operation (in step (13)). If the frcpa instruction encounters an exceptional situation such as unmasked division by 0, and an exception handler provides the result of the divide, p6 is cleared and no other instruction from the sequence is executed. The result is still provided where it is expected. Another observation is that the first and last instructions in the sequence use the user status field (sf0), which will reflect exceptions that might occur, while the intermediate computations use status field 1 (sf1, with wre = 1). This implementation behaves like an atomic double precision divide, as prescribed by the IEEE Standard. It sets correctly all the IEEE status flags (plus the denormal status flag), and it signals correctly all the possible floating-point exceptions if unmasked (invalid operation, denormal operand, divide by zero, overflow, underflow, or inexact result).
Floating-Point Remainder Software Assistance Conditions for Scalar Floating-Point Divide The main issue identified in the process of proving the IEEE correctness of the divide algorithms [4] is that there are cases of input operands for a/b that can cause overflow, underflow, or loss of precision of an intermediate result. Such operands might prevent the sequence from generating correct results, and they require alternate algorithms implemented in software in order to avoid this. These special situations are identified by the following conditions that define the necessity for IA-64 Architecturally Mandated Software Assistance for the scalar floating-point divide operations:
![]() where ea is the (unbiased) exponent of a; eb is the exponent of b; emin is the minimum value of the exponent in the given format; emax is its maximum possible value; and N is the number of bits in the significand. When any of these conditions is met, frcpa issues a Software Assistance (SWA) request in the form of a floating-point exception instead of providing a reciprocal approximation for 1/b, and clears its output predicate. An SWA handler provides the result of the floating-point divide, and the rest of the iterative sequence for calculating a/b is predicated off. The five conditions above can be represented to show how the two-dimensional space containing pairs (ea, eb) is partitioned into regions (Figure 4 of [4]). Alternate software algorithms had to be devised to compute the IEEE correct quotients for pairs of numbers whose exponents fall in regions satisfying any of the five conditions above. Note though that due to the extended internal exponent range (17 bits), the single precision, double precision, and double-extended precision calculations will never require architecturally mandated software assistance. This type of software assistance might be required only for floating-point register file format computations with floating-point numbers having 17-bit exponents. When an architecturally mandated software assistance request occurs for the divide operation, the result is provided by the IA-64 Floating-Point Emulation Library, which has the role of an SWA handler, as described further. The parallel reciprocal approximation instruction, fprcpa, does not signal any SWA requests. When any of the five conditions shown above is met, fprcpa merely clears its output predicate, in which case the result of the parallel divide operation has to be computed by alternate algorithms (typically by unpacking the parallel operands, performing two single precision divide operations, and packing the results into a SIMD result). |