Interconnect and Noise Immunity Design for the PentiumŪ 4 Processor (continued)


Previous Next     Page 9 of 15

FULL-CHIP WIRE NOISE VERIFICATION

The key idea behind the PentiumŪ 4 processor full-chip noise verification is "strobed signaling." A non-restoring node for noise is defined as a node, which if falsely tripped due to noise, will not recover with the passage of time (e.g., domino node or off pass gate latch). A signal is called "strobed," if its logic cone leading to a non-restoring noise node is controlled with a clock (e.g., D1k domino). In this case, the effect of noise on this node may be dependent on clock frequency.



Figure 13: Impact of frequency on noise failure

As shown with the D1-k example in Figure 13, at a lower frequency, the noise will settle down before the signal is sampled and as such will not fail at the lower frequency. In most cases, the timing of aggressors switching for noise is earlier than predicted by max delay timing analysis due to a reduced Miller Coupling Factor (MCF) in the noise case. Further, the worst noise case is usually on fast silicon at high voltage (good for speed). As such, in most cases, we can ignore the cases leading to a slight frequency slowdown in our analysis. The tricky situations are those that lead to excessive frequency slowdown or even worse, frequency shmoo holes. Before spending valuable CAD tool resources on these non-trivial cases, we needed to convince ourselves that the common benign case is indeed the dominant one and therefore the one on which to base our full-chip wiring methodology.

Most full-chip signals are busses (~59,000 out of 72,000 nets), and less than 10% of full-chip signals are "sensitive" (feeding domino receivers or direct pass gate, etc.). Most busses have similar timing among different bits, which should ease the frequency slowdown and shmoo problem. Figure 14 shows the significant effect of this analysis. Most of the effect of this filtering was due to the "required filtering" that characterized frequency slowdown, and very little was due to "valid filtering," which looks for aggressors not switching together.



Figure 14: Impact of frequency independent timing filtering

Frequency Independent Filtering
To solve the rare cases of real noise problems on a strobed signal, we decided to classify noise issues as follows: 1) functional failure at all frequencies; 2) slight slowdown; 3) large slowdown; 4) frequency shmoo hole at a lower frequency as shown in Figure 15; 5) mindelay switching induced noise failure; and 6) excessive coupling causing gate oxide wearout. Issue number 6 was achieved simply through a VCC/2 coupling noise clamp, which was used as a warning. For the rest, we had to implement timing filtering, which understood changing timing relations at different frequencies. Timing filtering was first implemented for the IntelŪ PentiumŪ Pro processor as the tool Crosswind [4], and it introduced the concept of valid and required time window filtering; valid window noise 'profiling' or juxtaposition of aggressor noise over the clock period; and rudimentary modeling of drive ratios with fixed thresholds for noise sensitization. Later implementations developed for the PentiumŪ II and PentiumŪ III processors improved on several aspects of driver and interconnect modeling.



Figure 15: Frequency shmoo hole

The novel features of timing filtering for the Pentium 4 processor include three modes of frequency analysis (low frequency for burn-in analysis, high frequency for at-frequency noise and delay tests, and all-frequency sweep for noise effects); timing skew between victim and aggressors; required-time filtering with victim recovery; and an interactive graphical waveform interface for timing filter debug.

The design of the Pentium 4 processor brought new challenges to timing filtering because of the complexity of its clocking system. In earlier clocking styles, an excessive slowdown or shmoo hole was usually caused by a very late signal coupling into a signal with early-required time or by the interaction between signals from opposite phases. In the Pentium 4 processor, however, the design incorporates several clocks that are multiples of each other: signals are F(ast), M(edium), and S(low) clocked signals. Not only do signals occur in different phases, but also with different periods. In addition, these differently clocked signals interact as they are not a priori restricted to different regions of the chip. Thus, mid-frequency shmoo holes are much more probable in such a design.

The new approach handles a clocking system with an arbitrary number of phases and an arbitrary number of synchronous clock frequencies by using a Multi-Frequency Algorithm.

At very low frequencies, signals activated by different phases are widely separated in time, so much so that they do not interact. This represents the low end of all frequencies to be considered, while the target operating frequency represents the high end. Sweeping frequencies at a small enough increment to catch waveform overlaps is prohibitive due to the complexity of the internal scan. We, therefore, needed a more adaptive algorithm. Here is the entire algorithm with an all-frequency sweep as its outer loop:

For each victim net:

  1. Collect aggressor set for a given victim and skew timings appropriately.
     
  2. Map clock edge references onto phases of an appropriate clocking system. For example, a set of aggressors with M and F rising edge references requires a two-phase system.
     
  3. Perform a noise sweep, computing aggressor interaction sets and generating timing "filter table."
     
  4. Compute the next highest frequency of interaction among signals.
     
  5. Return to step 2 until there is no more interaction among signals.



Figure 16: Illustrating logical switching set groups

The most difficult part of the algorithm is to compute the frequencies of interaction, as illustrated in Figure 16. Given that an O(N log N) scan is in the internal loop, the algorithm cannot afford to sweep with a very fine grain to catch all interactions.

The key to computing the next frequency of interaction is to comprehend the relative velocity of timing edge references as one slows the primary clock. By carefully searching the edges most close to one another and keeping track of their relative velocities, this algorithm can be made reasonably efficient. One difficulty is handling edges that refer to a previous clock phase that are actually moving backward with respect to other timing edges as frequency is increased. To handle this and other difficulties, we developed a general approach to handling both the modular nature of signal timings and measuring the frequency at which they may intersect, based on the concept of relative edge velocity.

Full-Chip Noise Convergence
Detailed noise verification requires a lot of data: circuits, timing information, detailed parasitics, interconnect, etc. For a lead processor like the Pentium 4 processor, "clean" data for all nets are available only very close to tapeout. Further, this detailed model is too slow to turn and, moreover, it is serial in nature. After finding a violation, one has to backtrack through numerous files, models, and schematics to verify if a real problem exists (needle in a haystack scenario). With these incomplete data, trending and schedule predictions are difficult.

To circumvent these problems, simple "perturbation"-based models were built using mathematical spreadsheet software. Parallel probes gather all relevant information about a net (timing, parasitics, length, circuit, etc.) to a total of 87 relevant metrics for each net! Approximately 40 full-chip models were built in one week for various "what if"(perturbation) scenarios. These models looked at tweaking various knobs: number of aggressors, switching probabilities of small aggressors, synchronization of noise propagation with coupling, probability of multiple noise events on same gate, various clock skew assumptions for timing filtering, various frequencies for allowed frequency slowdown, etc., to find reasonable settings and really serious problems but not produce too many false violations. A detailed NoisePad model was used as the starting point for these models. After this analysis, the new noise was assumed to be a slight perturbation around its NoisePad value and predicted by the change in the knob (e.g., changing lumped %xcap from 100% to 50%).

Although these fast models were very crude, they were surprisingly accurate because they did not try to predict the real noise but rather the perturbation (much smaller error). Based on these fast models, another detailed NoisePad model was built with correct knob settings and used for final convergence. As can be clearly seen from Figure 18, this exercise helped us greatly with convergence and saved us an estimated one to two months in our noise convergence schedule. The dramatic decrease in noise violations seen in Figure 17 involved no work from the design team!



Figure 17: Road to noise convergence on the PentiumŪ 4 processor




Previous Next     Page 9 of 15