Persistent Memory Programming—Handling Memory Errors

Hi, I'm Andy Rudoff from Intel. In this video, I'll provide an overview on how to handle memory errors when developing programs for persistent memory, or PMEM. 

The error handling strategy for an application is an important part of the program's overall reliability. For server applications, this strategy directly affects availability. In other words, what percentage of the time the application is expected to be available to do its job? PMEM programming brings some interesting new considerations to error handling. So let's dig into the details. 

First, some background on memory errors. The main memory on a server, often referred to as DRAM, is protected using error correction codes, or ECC. This is a hardware feature that can automatically correct many memory errors that happen due to transient hardware issues, such as power spikes, media errors, and so on. 

But if an error is serious enough, it will corrupt so many bits that ECC can't correct it. And the result is known as an uncorrectable error. Most applications never worry about uncorrectable errors. Server harbor has become reliable enough that they are a rare occurrence. 

Using the Linux operating system environment as an example, if a program experiences an uncorrectable error in DRAM, the application is sent a SIGBUS signal which kills it. While it's possible for an application to catch a SIGBUS instead of exiting, it's very rarely done since the recovery logic would be very complicated. 

Instead, the application dies. And the DRAM containing the error is returned to the system where it is re-initialized before it can be used again. In this way, memory error handling is very simple. The server application dies. The memory is returned to the system. And typically, the application is restarted where it begins again with fresh DRAM. 

With that background, let's compare how uncorrectable errors in PMEM differ from uncorrectable errors in DRAM. The events start out the same. When an uncorrectable error happens with persistent memory, a SIGBUS is sent to kill the application. 

But since persistent memory is, well, persistent, you may have already figured out that the error doesn't just go away because the application dies. If you restart the application, the most likely thing to happen next is that the application hits the exact same uncorrectable error in persistent memory and gets killed again with a SIGBUS. 

For this reason, the operating system keeps track of areas in persistent memory where there are known uncorrectable errors. Here you see an example of the NDCTL command in Linux listing the known bad blocks in persistent memory. The libraries in the persistent memory developer kit, or PMDK, automatically look at this information and will prevent a program from opening a persistent memory pool if it contains these errors. In this example, notice how PMDK's PMEM Pool command indicates there are known errors. 

Putting all this information together, the simplest way for an application developer to handle memory errors is to let the application die when it gets a SIGBUS. This avoids the complicated programming of trying to handle SIGBUS at runtime. On restart, the application can detect that the persistent memory pool contains errors using PMDK and can repair the data during application initialization. 

For many applications, this repair can be as simple as reverting to a backup error-free copy of the data. You can see application developers are faced with some interesting choices. But it isn't hard to get started by using the simplest, most common techniques initially. And only bringing more complexity into your program if it turns out to be necessary. 

See the links provided for example programs and more documentation, tutorials, and videos on persistent memory programming. Don't forget to like this video and subscribe. Thanks for watching.