
IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating-Point Arithmetic (continued)
Page 7 of 15
IA-64 FLOATING-POINT SQUARE ROOT
The IA-64 floating-point square root algorithms are also based on Newton-Raphson or similar iterative computations. If needs to be computed and the Newton-Raphson method is used, a number of Newton-Raphson iterations first calculate an approximation of 1/ , using the function
f(y) = 1/y2 - a
The general iteration step is
en = (1/2 - 1/2 a yn2)rn
yn+1 = (yn + en yn)rn
where the subscript rn denotes the IEEE rounding to the nearest mode. The first computation above is rearranged in the real algorithm in order to take advantage of the fma instruction capability.
Once a sufficiently good approximation y of 1/ is determined, S = a y approximates . In certain cases, this too might need further refinement.
In order to show that the final result generated by a floating-point square root algorithm represents the correctly rounded value of the infinitely precise result in any rounding mode, it was proved (by methods described in [3] and [4]), that the exact value of and the final result R* of the algorithm before rounding belong to the same interval of width 1/2 ulp, adjacent to a floating-point number. Then, just as for the divide operation
( )rnd = (R*)rnd
where rnd is any IEEE rounding mode.
Floating-point square root algorithms are provided for single precision, double precision, double-extended and full register file format precision, and for SIMD single precision, in two variants. One achieves minimum latency, and one maximizes throughput.
SIMD Floating-Point Square Root Algorithm
We next present as an example the algorithm that allows computing the SIMD single precision square root, and which is optimized for throughput, having a minimum number of floating-point instructions. The input operand is a pair of single precision numbers (a1, a2). The output is the pair ( 1, 2). All the computational steps are performed in single precision. The algorithm is shown below as a scalar computation. The approximate values shown on the right-hand side are computed assuming no rounding errors occur, and they neglect some high order terms that are very small.

The first step is a table lookup performed by fprsqrta, which gives an initial approximation of (1/ 1,1/ 2) with known relative error determined by m = 8.831. The following steps implement a Newton-Raphson iterative algorithm. Specifically, step (5) improves on the approximation of (1/ 1,1/ 2). Steps (3), (6), (10) and (13) calculate increasingly better approximations of ( 1, 2). The algorithm was proved correct as outlined in [3] and [4]. The final result (R1,R2) equals (( 1)rnd,( 2)rnd) for any IEEE rounding mode rnd, and the status flag settings and exception behavior are IEEE compliant.
The assembly language implementation is as follows (only the floating-point operations are numbered):

Software Assistance Conditions for Scalar Floating-Point Square Root
Just as for divide, cases of special input operands were identified in the process of proving the IEEE correctness of the square root algorithms [4]. The difference with respect to divide is that only loss of precision of an intermediate result can occur in an iterative algorithm calculating the floating-point square root. Such operands might prevent the sequence from generating correct results, and they require alternate algorithms implemented in software in order to avoid this. These special situations are identified by the following condition that defines the necessity for IA-64 Architecturally Mandated Software Assistance for the scalar floating-point square root operation:
ea emin + N - 1 (di might lose precision)
where ea is the (unbiased) exponent of a, emin is the minimum value of the exponent in the given format, and N is the number of bits in the significand. When this condition is met, frsqrta issues a Software Assistance request in the form of a floating-point exception, instead of providing a reciprocal approximation for 1/ , and it clears its output predicate. The result of the floating-point square root operation is provided by an SWA handler, and the rest of the iterative sequence for calculating is predicated off. Due to the extended internal exponent range (17 bits), the single precision, double precision, and double-extended precision calculations will never require architecturally mandated software assistance. This type of software assistance might be required only for floating-point register file format computations with floating-point numbers having 17-bit exponents.
When an architecturally mandated software assistance request occurs for the square root operation, the result is provided by the IA-64 Floating-Point Emulation Library.
Just as for the parallel divide, the parallel reciprocal square root approximation instruction, fprsqrta, does not signal any SWA requests. When the condition shown above is met, fprsqrta merely clears its output predicate, in which case the result of the parallel square root operation has to be computed by alternate algorithms (typically by unpacking the parallel operands, performing two single precision square root operations, and packing the results into a SIMD result).
|