New Reliability, Availability, and Serviceability (RAS) Features in the Intel® Xeon® Processor Family

ID 672700
Updated 6/20/2017
Version Latest
Public

author-image

By

Introduction

Intel® Xeon® processor Scalable family is introducing several new Reliability, Availability, and Serviceability (RAS) features across the product lineup (SKUs to be designated as Bronze, Silver, Gold, and Platinum). The newly added features can help enhance the end-user experience with the platform’s ability to recover from bad data consumption, capabilities in detecting bad instruction and retrying the transaction in attempt to recover. The processor also offers a new innovative approach to map out failing DRAM devices to help prolong the useable life of DIMMs.

Adaptive Double DRAM Device Correction (ADDDC), Advanced Error Detection and Correction (AEDC), Local Machine Check Exception (LMCE) are the features this collateral will explore.

Adaptive Double DRAM Device Correction (ADDDC)

Intel® Xeon® processor introduces an innovative approach in managing errors that the DDR4 DRAM DIMM may induce through the life of the product. ADDDC is deployed at runtime to dynamically map out the failing DRAM device and continue to provide SDDC ECC coverage on the DIMM, translating to longer DIMM longevity. The operation occurs at the fine granularity of DRAM Bank and/or Rank to have minimal impact on the overall system performance.

With the advent of ADDDC, the memory subsystem is always configured to operate in performance mode. When the number of corrections on a DRAM device reaches the targeted threshold value, with help from the UEFI runtime code, the identified failing DRAM region is adaptively placed in lockstep mode where the identified failing region of the DRAM device is mapped out of ECC. Once in ADDDC, cache line ECC continues to cover single DRAM (x4) error detection and apply a correction algorithm to the nibble.

Dependent on the processor SKU, each DDR4 channel supports one to two regions that can manage one or two faulty DRAMs, at Bank and/or full Rank granularity. The dynamic nature of the operation makes the performance implications of the lockstep operation on the system to be material only after the DRAM device is detected to be failing. The overall lockstep impact on system performance is now a function of the number of bad DRAM devices on the channel, with the worst-case scenario of two bad Ranks on every DDR4 channel.

The Silver/Bronze SKUs offer Adaptive Data Correction (ADC [SR]), at Bank granularity, and the Platinum/Gold SKUs offer Adaptive Double DRAM Device Correction (ADDDC [MR]), at Bank and Rank granularity, with additional hardware facilities for device map-out.

Advanced Error Detection and Correction (AEDC)

AEDC improves the fault coverage within the core execution engine by utilizing proprietary residue code fault-detection checking to identify and correct errors the processor may encounter within its internal pipelines within the execution engine (arrays and logic). AEDC will attempt to correct the fault by retrying the instruction. The successfully corrected retry is considered as a corrected event; otherwise, fatal MCERR is logged and signaled.

AEDC technology in the processor is self-contained. It uses the existing error signaling and logs to flag errors, and needs no special assistance from the operating system to become operational. AEDC is offered across all product SKUs.

Local Machine Check Exception (LMCE)

LMCE is a new RAS operation that localizes handling of bad data consumption to the core that’s executed on the bad data. By localizing error handling in such manner, the system can prevent multiple machine check condition from occurring and improves on the performance of the MCA Recovery — Execution Path.

By localizing error signaling, each remote core that comes across bad data can also invoke its own LMCE, each attempting recovery without interfering with the operation of other cores. LMCE can help successful recovery from a number of corner cases and improve successful recovery flows.

MCA Recovery — Execution Path

The MCA Recovery — Execution Path feature offers the capability for a system to continue to operate even when the processor is unable to correct data errors within the memory sub-system and allows software layers (operating system, VMM, DBMS, and applications) to participate in system recovery.

The recovery can occur on SRAR error types, and the machine check architecture protocol requires the Machine check error (MCERR) to be broadcast to all threads and establish a rendezvous point. In cases where the cores consume the bad data within near proximity of one another, each thread signaling MCERR Error creates multiple MCERR conditions that result in an undesirable system shutdown.

LMCE can help overcome such conditions by localizing MCERR signaling to the consuming thread only, allowing each thread to recover from the bad data it consumed. This change in the protocol requires the operating system to also be aware of the LMCE-ready platform, and then opt-in to support LMCE flows.

How LMCE is enabled

LMCE support requires processor, UEFI code, and operating system support for the operation. By default, the operation is disabled and can be enabled only if the ingredients are available in each of the stacks. The following steps need to be taken before LMCE can be used:

  1. The hardware indicates to UEFI code that LMCE support is available in the SKU.
  2. In a firmware-first model, the UEFI code must comprehend LMCE flow and signal platform readiness to support the flow to the operating system.
  3. The operating system needs to comprehend LMCE flows and check the platform readiness to support LMCE. If the operating system is not aware of this feature then LMCE remains OFF.

More information about LMCE can be found in Intel® 64 and IA-32 Architecture Software Developer Manuals.

Conclusion

Intel Xeon processors continue to enhance system RAS feature offerings across all segments of the computing industry. Intel® Xeon® platforms utilizing any one of processor SKUs, Bronze, Silver, Platinum or Gold SKUs can benefit from the enhancements. The new capabilities translates to higher system reliability and availability achieved through innovative error detection and retry mechanisms, improvements to recovery methodology, and performance optimized memory subsystem capable of prolonging useful life of the installed DDR4 DIMMs.

References

  1. Intel® Run Sure Technology
  2. Application of Residue Code for Error Detection
  3. Error Code Detection and Correction