Processors
FDIV Replacement Program
Statistical Analysis of Floating Point Flaw: Intel White Paper - Section 5

Section 5: Evaluation Framework to Gauge Impact on User
The significance of the flaw to an end-user clearly depends upon:

1. The frequency of occurrence of the reduced precision divide within the application. If the flaw is unlikely to be seen during the practical lifetime of the computer, it is of no significance to the user.

2. The (propagated) impact to the end-user when the problem manifests itself.

The frequency of occurrence of the reduced precision divide depends upon the rate of use of the specific FP instructions in the Pentium CPU by the user, and upon the data fed to these instructions. If and when the problem manifests itself, the impact on the end-user depends upon the way in which the results of these instructions (along with any inaccuracies) are propagated into further computation in the application, and upon the way in which the final results of the application are interpreted by the end-user.

The evaluation methodology thus involved first estimating the frequency of occurrence of the reduced precision divide for random input data, and then analyzing each potential occurrence and its environment to gauge its end-impact. The subsequent sections describe the statistical method followed to characterize the frequency of occurrence, propose a metric for comparison, and present reference information to calibrate the significance of a given rate of occurrence.

5.1 Statistical Characterization Methodology for Frequency of Occurrence
This section describes the general statistical evaluation methodology suitable to analyze the frequency of occurrence of the reduced precision divide in applications. Given that it is intractable to describe the exact set of input operands on which the problem can get triggered, and given that the incidence of the problem is best described as a statistical probability, the method suitable for characterizing the frequency of this problem in an end-user application finds a close parallel to the conventional framework used for evaluating reliability of a computer system given an assortment of hard and soft failure modes.

For any given failure mechanism, conventional reliability methods define the FIT rate (or Failures In Time) in terms of the number of device failures produced by the mechanism in every 10E9 hours of device operation. The Mean Time Before Failure or MTBF is simply the inverse of the FIT rate.

When examining the reliability of the overall computer system, one focuses upon the failure mechanisms with the highest FIT rates, since these will make the dominant contribution to a system failure in the field. For example, system failures can occur due to a wide variety of reasons, such as:

1. Human errors in installation

2. Power supply failures

3. Packaging and system interconnect defects

4. CPU failures

5. Memory failures

6. Disk drive failures

7. Keyboard failures, and

8. Failure mechanisms from other devices.

These failure mechanisms span a wide range of FIT rates, and it is typically the mechanism with the highest FIT rate that is most significant from the point of view of frequency of failures.

5.2 Metric for Evaluating Frequency of Occurrence
A modified form of the conventional definition of the FIT rate has been found to be a convenient metric for evaluating the frequency of occurrence.

Consider the following analysis:

In effect, the analysis involves calculating the effective FIT rate due to this failure mechanism in the context of the given application. As can be seen from the above analysis, the mean time before an inaccuracy is encountered is simply the time taken for the user to exercise the application with 9 billion independent divide operations. Alternatively, characterizing the various applications in terms of how many independent operations of interest (e.g. divide instructions) are run per unit time (say days) will provide an effective metric for the frequency of occurrence of the reduced precision divide, assuming totally random input data to the instructions.

Based on the FIT rate for this failure mode alone, a calculation is performed on the MTBF due to this failure mode. This MTBF is then compared against the MTBF due to other failure modes, and against the lifetime of the part, to give the user a perspective against which to judge the rarity of the error due to the flaw.

5.3 Reference Failure Rates
Table 5-1 below summarizes a few failure mechanisms and FIT rates typical in a commercial PC system based on the Pentium processor. Also included in the table is a sample FIT rate for a typical PC user running spreadsheet calculations involving 1,000 independent divides per day on a Pentium processor that exhibits the flaw. As can be seen from this table, the FIT rate due to the flaw bears little significance for such a user because the mean time before encountering an inaccuracy far exceeds both the time before other failure mechanisms begin to play, as well as the practical lifetime of the PC.

Table 5-1 Typical System Failure Rates


This applies to:

FDIV Replacement Program



Solution ID: CS-013012
Last Modified: 17-Oct-2014
Date Created: 08-Jul-2004
Back to Top