Case Study Memory Failure Prediction Tencent Cloud Solutions ® Intel Memory Failure Prediction at Tencent

Intel® Memory Failure Prediction substantially improves memory reliability through online machine learning and reduces downtime Business: Tencent is one of the biggest cloud solution providers in China with a presence throughout three continents. Challenges • Real-time visibility into memory health • Effective DIMM replacement strategy • Predictive insights into server memory uptime and workload transfer Solution • Intel® Memory Failure Prediction Tencent Seafront Towers in Shenzhen, China Executive Summary Tencent, a leading China-based global cloud-solutions provider with operations in APAC, Europe and North America, set up Intel® Memory Failure Prediction (Intel® MFP) for a test deployment with thousands of servers based on Intel® Xeon® Scalable Processors to reduce downtime caused by server memory failures. Tencent’s IT staff deployed Intel® MFP in their data center and integrated it into their existing management systems to analyze their server memory failures, predict potential future failures, reduce downtime, and improve their current Dual Inline Memory Module (DIMM) replacement and upgrade policies. The Intel® MFP deployment resulted in improved memory reliability due to predictions based on the capture of micro-level memory failure information from the operating system’s Error Detection and Correction (EDAC) driver which stores historical memory error logs. Intel® MFP also gave Tencent’s IT staff enough information to proactively address potential memory issues, and replace failing DIMMs before they reach a terminal stage and cause server failures, and thus reducing downtime. This initial test deployment indicated 5X improvement on DIMM level failure prediction. If Tencent deployed Intel® MFP across its entire data centers, they would improve the effectiveness of server reliability aware workload management and decrease the percentage of Uncorrectable Errors (UEs) and therefore significantly reduce downtime. Additionally, Tencent’s operational efficiency would improve and so would their expenses on unnecessary DIMM purchases. 1 Case Study | Intel® Memory Failure Prediction at Tencent ® Intel Memory Failure Pediction at Tencent Reduces uncorrectable memory errors Simplifies workload migration decision making Improves DIMM failure prediction 5X Optimizes page offlining policies Improves DIMM toss & purchase decisions Reduces downtime caused by server memory failures Background Memory failures are one of the most critical hardware failures that occur in data centers today. Intel® MFP is a perfect solution for organizations such as online and cloud service providers that depend heavily on server reliability, availability and serviceability (RAS). Intel® MFP predicts memory failure events by analyzing historical data to prevent potential catastrophic events before they happen. Intel® MFP is vendor agnostic and works in conjunction with other data center management solutions including Intel® Data Center Manager (Intel® DCM). Once deployed, the resulting data can be used to analyze and predict server memory issues before they happen. Tencent deployed Intel® MFP in a test environment containing thousands of servers with Intel® Xeon® Scalable Processors to gain better insights into their memory health. Intel® MFP monitored the health of the servers’ Dynamic Random Access Memory (DRAM) modules and provided administrators with critical information about them including a health-score based on their historical data. Intel® MFP Provides Real-time Memory Health Insights Intel® MFP uses online machine learning to analyze the historical data collected on server memory down to the DIMM, bank, column, row, and cell levels and gives a memory health score to predict potential future failures. The resulting analysis and health scores indicated the potential for a large number of memory issues within Tencent’s test environment including both Correctable Errors (CE) and Uncorrectable Errors (UE). A