Section 4: Pentium® Processor Divide Algorithm The Pentium processor is Intel's next generation of compatible microprocessors following the Intel486™ CPU family. The primary goal was to combine compiler and hardware technology to maximize performance while preserving software compatibility. More specifically the floating point performance goal was to achieve up to a 5x speedup of floating point vector code and a 3x speedup of scalar code when compared to an Intel486 processor of identical clock frequency[1]. In part, this meant providing a higher performance floating point divider.
The Intel486™ processor used the classic nonrestoring "shift and subtract" division algorithm for its floating point divide operation. This inherently allowed only one quotient bit to be generated per clock. To improve the floating point divide performance on the Pentium processor, a radix 4 SRT algorithm was chosen. This algorithm, as implemented, allows the divide hardware to generate 2 bits of quotient per clock thus approximately doubling the divide performance.
4.1 SRT Algorithm To help better understand the flaw, we have attached a brief introduction to the SRT division technique.
The SRT divide algorithm can basically be described in steps as:
1. Sample the most significant digits of the divisor and dividend.
2. Use the samples as indexes into a lookup table to get a guess of the next quotient digits. (in this case the quotient guess can be 2, 1, 0, +1, +2).
3. Multiply the divisor by the quotient guess and subtract it from dividend (Note that this is an addition if the quotient guess was negative).
4. Save the quotient digits in the least significant digits of a quotient register.
5. Shift the remainder left by 2 and shift the quotient registers left by 2 (i.e. radix 4)
6. Sample the most significant digits of the new shifted partial remainder.
7. Go to step 2 unless you have generated enough significant quotient digits.
8. Generate the binary quotient by assembling the values in the quotient register.
9. If the last partial remainder was negative then adjust the quotient by subtracting the value.
For a complete treatment on the subject of SRT division refer to the paper written by Daniel E. Atkins[2].
Mathematically, the SRT algorithm can be represented with the equations shown in (EQ 1) and (EQ 2) below. The first equation shows the first step of the algorithm where P0, the initial partial remainder, is the dividend. The second equation indicates the recursive nature of the algorithm. Here, to generate the next partial remainder (Pj+1) one must multiply the quotient guess (qj+1) by the divisor (d) and subtract that quantity from the shifted partial remainder of the previous iteration (rPj). Where (r) is the radix of the operation (in this case 4). The number of iterations is determined by the number of significant digits required in the quotient remembering that the first iteration only produces one significant digit.
Note that when the radix is equal to the operative base, this recursive equation yields one digit of quotient per iteration. For higher radix division a multidigit quotient can be generated each iteration. In the case of the Pentium processor, radix 4 division is performed on binary numbers yielding 2 binary quotient digits per iteration. This relationship holds true if the following restriction (for convergence) is placed on the remainder:
The quantity n/(r1) is referred to as the measure of redundancy (MoR) and in the Pentium processor divide implementation evaluates to 2/(41)=2/3. This value will be used from here on.
4.1.1 Quotient Selection Since a nonrestoring division is used and the result is stored in "redundant" form, the guessed quotient need only meet the above remainder criteria. By manipulating the above equations, MIN and MAX equations that bound the next partial remainder for a given divisor and quotient digit are produced.
Plotting these equations generates a plot of the PartialRemainder and Divisor (commonly known as the "PD" plot), that is used in the quotient selection process. The positive half of this plot is shown in Figure 41 . The shaded regions in this plot indicate the region of overlapping quotient choices, where either of the digits indicated may be chosen.
4.1.2 Region of Uncertainty Since the divisor and the shifted partial remainder are truncated there is some uncertainty in the generation of the quotient bits. One must take this uncertainty into account when selecting quotient digits, that is, a quotient digit can only be selected if the region of uncertainty for the given Partial remainder/Divisor pair lies entirely within a quotient region delineated by the equations above. This is where the overlapping nature of these regions comes in handy.
For the uncertainty in the Divisor which is chopped after the 4th bit to the right of the binary point (i.e. maintaining 4 bits to the right of the binary point).
For the Partial Remainder, the estimate comes from the 7 most significant bits of the redundant form (carrysave) "shifted partial remainder" (4 bits to the left and 3 bits to the right of the binary point) thus:
Figure 42 illustrates a simple binary example of the same iterative equation for radix 4 SRT. Note the way that Qpos and Qneg are used to store positive and negative weighted quotients, how the final quotient is derived from the two, and which bits are sampled as indexes into the lookup table (the shaded regions in the partial remainders).
4.2 The Underlying Cause After the quantized PD plot (lookup table) was numerically generated as in Figure 41 , a script was written to download the entries into a hardware PLA (Programmable Lookup Array). An error was made in this script that resulted in a few lookup entries (belonging to the positive plane of the PD plot) being omitted from the PLA. The 5 critical entries are shown in Figure 43 as the shaded regions. As a result of the omission, a divisor/remainder pair that hits these entries during the lookup phase of the SRT algorithm will incorrectly read a quotient digit value of 0 instead of +2. Subsequently, the iterative algorithm will return a quotient result with reduced precision.
As can be seen from the PD plot, the only situations in which there is a probability of seeing this flaw is when the binary divisor has the following bit patterns in the most significant bits: 1.0001, 1.0100, 1.0111, 1.1010, and 1.1101. Empirically, it has been observed that these bit patterns need to be followed by a long string of 1's to further boost the probability of incurring the inaccuracy due to the flaw. Note also, that since the divide hardware operates only on the mantissa the exponents of the operands have no effect on whether this flaw is observed or not.
4.3 Instructions Affected Since it is the divider hardware that exhibits the flaw, instructions that will exhibit the flaw include:
FDIV
FDIVP
FDIVR
FDIVRP
FIDIV
FIDIVR
FPREM
FPREM1
The following transcendental instructions use the divide hardware within their computation but empirical testing of billions of cases have not shown any error:
FPTAN
FPATAN
FYL2X
FYL2XP1
This applies to:
