Basic Diagnostics for Correctable/Uncorrectable ECC Memory Errors with Intel® Server Boards

Documentation

Troubleshooting

000024007

04/18/2022

What am I seeing?

Correctable and/or Uncorrectable Error Correcting Code (ECC) events for memory modules. For example:

Mmry ECC Sensor SMI Handler Warning Memory CPU: 1, DIMM: D0 DIMM Rank: 1. - Correctable ECC / other correctable memory error - Asserted.

What is Memory Error Correction Code (ECC) Correctable Error Event?

ECC correctable error represents a thershold overflow for a given Dual In-line Memory Modules (DIMM) within a goven timeframe.


How to fix it

Memory data errors are logged as correctable or uncorrectable. Refer to the instructions below, based on the error type you encounter:

error types

Notes 
  • If there is no catastrophic issue (Purple Screen of Death (PSOD) or unexpected restart) , and the correctable ECC error including Adaptative Double Device Data Correction (ADDDC) error that is less than 10 events every 24 hours for each DIMM location is within threshold limit, so the recommendation is to monitor for any recurrence of ECC error each DIMM location that triggers the event.
     
  • If there is no catastrophic issue (Purple Screen of Death (PSOD) or unexpected restart) , and the correctable ECC error including Adaptative Double Device Data Correction (ADDDC) error that is less than 10 events every 24 hours for each DIMM location, it is recommended to re-seat each DIMM location by following the steps below:
    1. Power OFF the system and remove the AC power cable;
    2. Identify the DIMM location to re-seat, refer to Technical Product Specifications for your server platform to identify DIMM location;
    3. Perform the re-seat of identified DIMM(s);
    4. Insert AV power cable and power ON the system;
    5. Observe for 24 hours for any recurrence of ECC error;
    6. If the ECC error persists with the same DIM location that was re-seated, then generate and send SEL and Debug logs, both generated from the BMC Web Console, to Intel Customer Support

 

Notes

The Error Correction Code (ECC) errors are self-correcting. Depending on the Reliability Availability Serviceability (RAS) configuration of the memory, the Integrated Memory Controller (IMC) may take the affected DIMM offline.

For different Intel server platforms, there are some differences in their event definition, refer to System Event Log Troubleshooting Guide for your server platform

Intel recommends to download and update the system BIOS to the latest available version for your server platform.

If the system is an Intel® Data Center Block for Nutanix* Enterprise Cloud, rather, visit the Nutanix* Life Cycle Manager page. For a list of hardware and firmware compatibility, visit the Nutanix* Hardware and Firmware compatibility page.

 

Related topics
The Role of ECC Memory
How to Recover from an IERR for Intel® Server Boards
My Server Crashes and Shows this Error: Processor CPU Machine Chk
For firmware updates and troubleshooting tips
What is the Threshold Tolerance for Correctable Memory Errors for Intel® Server Boards?