Intel® Memory Failure Prediction Improves Reliability at Meituan

Intel® Memory Failure Prediction
Meituan eCommerce Platform for Services

Intel® Memory Failure Prediction uses machine learning to send potential memory failure alerts prior to hardware failure and thus reducing impact of downtime.

Executive Summary

Meituan-Dianping (Meituan), a China leading e-commerce platform for services, setup Intel® Memory Failure Prediction (Intel® MFP) for a test deployment over several thousands of servers based on Intel® Xeon® Scalable Processors to help improve the performance and reliability of its server memory which is essential to fast data analytics computing.

Meituan deployed Intel® MFP in its data center, integrating it into their existing management solutions to take advantage of its memory analysis and predictive capabilities. The aim is to help them analyze and model server memory-failure data in order to predict potential failures, prevent downtime, and optimize their current Dual Inline Memory Module (DIMM) upgrade.

The Intel® MFP deployment resulted in improved memory reliability by predictions based on the analysis of the micro-level memory failure logs. Intel® MFP allowed data center staff to migrate workloads before catastrophic memory failures could happen, use page offlining policies to isolate unreliable memory cells or pages, or replace failing DIMMs before they reach a terminal stage, thus reducing downtime by responding appropriately before server failure occurs.

“We would thank Intel for Memory Failure Prediction collaboration with Meituan” said Rui Guo who is the leader of Infrastructure/Server technology at Meituan, “the testing results indicates, with Intel® MFP’s prediction capabilities, it could significantly reduce server hardware failures by up to 40 percent.”

Background

Meituan-Dianping, a China leading e-commerce platform for services, has Meituan, Dianping, Takeaway, Taxi, Mobike and other well-known apps for customers. Services include catering, takeaway, taxi, bike, hotels There are more than 200 categories such as tourism, film, entertainment and the business covers 2800 cities. To remain successful and competitive, Meituan has to be able to rely on the health of its data center infrastructure and predict failures to act proactively.

Memory failures are one of the top three hardware failures that occur in data centers today. Using Machine Learning to analyze real-time memory health data would make it possible to predict such failures ahead of time, and this ultimately translates to a better experience for their customers.

This is why Meituan deployed Intel® MFP in a test environment containing several thousands of servers with Intel® Xeon® Scalable Processors. They integrated Intel® MFP into their existing data center monitoring solution and were able to gain greater insights into server memory health.

Intel® MFP is an ideal solution for organizations such as online services platforms and cloud service providers relying heavily on server hardware reliability, availability and serviceability. The solution helps to significantly reduce memory failure events by analyzing data and then predicting catastrophic events before they happen.

Intel® MFP Provides Real-time Memory Health Visibility
Intel® MFP uses machine learning to analyze server memory errors down to the DIMM, bank, column, row, and cell levels to generate a memory health score which can be used to predict potential failures.

Meituan monitored the health of the memory modules of their servers by integrating Intel® MFP into their existing data center management solution. By analyzing data that was previously collected by their data center management software, they were able to generate prediction scores for each DRAM module, and then take appropriate action to maintain their SLAs and maximize their service uptimes.

Intel® MFP Enables Memory Reliability-aware Workload Migration
By using Intel® MFP, it generated memory health scores that helped Meituan to make memory reliability-aware decisions in workload scheduling, such as migrating the critical tasks running on distressed servers to other servers, giving them ample time to take actions and avoid critical application crashes.

Intel® MFP Reduces Unnecessary DIMM Replacements
By analyzing memory errors and predicting potential memory failures before they happen, Intel® MFP can help improve DIMM replace strategy.

Intel® MFP Optimizes OS Page Offlining
When there is a burst in the number of errors in a specific memory region, that region is likely to break down soon. By detecting this early, Intel® MFP can suggest disabling faulty memory pages, preventing them from being used again, and thus reducing the risk of uncorrectable errors. This is called Page Offlining and has become critical for large scale data centers today.

Intel® Memory Failure Prediction Deployment Results

By integrating Intel® MFP into management solution, Meituan was able to analyze the health of the memory of the servers in their test environment.

This helped Meituan to predict failures before they happen and make informed decisions such as using page offlining and migrating workloads and tasks to other servers.

The initial Intel® MFP test deployment indicated that if Meituan deployed the solution across its full server network, server crashes led by hardware failures could be reduced by up to 40 percent.

The deployment therefore revealed that if deployed across their entire datacenters, Meituan could significantly reduce server downtime due to memory failures, delivering a better experience for hundreds of millions of customers and local vendors.

Where to Get More Information

For more information on Intel® Memory Failure Prediction, visit intel.com/dcm or contact dcmsales@intel.com