- Performance Improvement Opportunities with NUMA Hardware
- NUMA Hardware Target Audience
- Modern Memory Subsystem Benefits for Database Codes, Linear Algebra Codes, Big Data, and Enterprise Storage
- Memory Performance in a Nutshell
- Data Persistence in a Nutshell
- Hardware and Software Approach for Using NUMA Systems
Applications use files to keep data from one run to the next. These files meet several requirements:
- They survive after the process exits.
- They can be read by a different program on a different system.
- They can be backed up and restored.
- They can be accessed by several processes simultaneously.
- The OS can cache portions of the file, start reading file data from the disk before a thread requests it, and allow the threads to continue while the file data is written to disk.
High-capacity, non-volatile memory devices make it possible to store data more effectively than using a disk-based file system. The programming effort to take advantage of these memory devices varies from almost none to significant depending on which of the above requirements must be met.
By A.Davey from Portland, Oregon, EE UU (Detail - Cuneiform Inscription) [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons
You can largely eliminate the time it takes to move data to or from media by replacing a solid-state drive (SSD) or hard disk drive (HDD) with a non-volatile memory device; however, OS overhead remains.
This replacement does not require any changes to most applications, because almost all applications let you specify file location.
You can eliminate the time to transition from user space to the OS if a thread memory-maps the file; however, the time to convert the file’s content to its in-memory representation remains.
Memory-mapped files in SSD or HDD also incur a delay to load the data when it is first touched. Memory-mapped files in non-volatile memory do not page fault – the page is in memory.
Converting the data has benefits that offset the conversion time overhead. The in-file representation can be more compressed than the in-memory representation, allowing more data to fit on the device. Applications accessing the data can use their own data structures, and they can change them without affecting the in-file representation. Lastly, it may be possible to protect the in-file representation from data corruption by using OS and programming language memory protection capabilities, even though the application is writing the in-memory representation.
Most languages allow applications to use I/O operations to read and write to strings rather than to files. This makes it easy to convert data structures into strings that can be stored in the in-file representation, but kept in persistent memory.
You can eliminate the conversion overhead if the memory-mapped data is in the representation that the threads use; however this comes at a high cost that can be avoided only by deliberately restricting the data structures that can be so mapped.
If the data structures contain memory addresses, the targeted entity must reload at the same address. Accomplishing this can be difficult. The address may be put there by the compiler rather than by the source code (the implementation of virtual functions in C++ does this). The address may be put there by a library (C++ STL containers do this). The targeted entity may be in a different file, or may be the address of a function or vtable in the program code (the C++ compiler generates code that does this).
You must make an early design decision. Choose one or more of the following tactics:
- Make sure the entities are loaded in the right place, and cancel if they are not.
- Do not use addresses in the in-file data representation.
- Relocate the addresses before using them.
You can often avoid addresses by using indices and maintaining an index-to-address table.
There is little difference between sharing a memory-mapped file between processes and sharing a data structure between threads. In both situations the code must cope with the challenges of memory access ordering and caching, cross-process locks, and addresses.
For data to survive a catastrophic failure of a portion of the hardware and/or accessing software requires:
- Some consistent, adequate representation of the data to survive on the remaining hardware
- Isolation of the remaining hardware from the failing portion
So you must decide which failures you need to survive; then you must use hardware and software to create firewalls that adequately isolate data from the failures. If your data must survive the Vogons demolishing Earth, you need the data replicated somewhere else.
The big questions for most users are what should happen if:
- Your application goes awry and corrupts the data?
- The memory device is no longer available?
- The process is killed?
The awry application is the most challenging scenario because it is difficult to distinguish between good and bad behavior. Coping with this scenario requires timely backups of the prior state. Similarly, coping with loss of the memory device requires timely backups.
That leaves the simplest scenario: the application simply stops writing. For example: An application modifying a memory-mapped file cannot write to media because of a hardware failure, or an application is interrupted during a series of writes. Because the compiler and the memory subsystem both reorder writes – and writes can be interrupted at any step – coding writes in specific order is not enough to prevent a problem.
You can solve this problem using fence or barrier instructions and the extensively studied (especially within the database community) concept of a transaction. Use undo and/or redo logs to compensate for the lost writes when the file is next mapped.
The previous article, Memory Performance in a Nutshell, explained how volatile memory performs. This article shows how to get big improvements for your non-volatile data. The next article, Hardware and Software Approach for Using NUMA Systems, describes the steps to modify your application to get these improvements.
About the Author
Bevin Brett is a Principal Engineer at Intel Corporation, working on tools to help programmers and system users improve application performance. He is inordinately fond of his three daughters – #1 a high school math teacher, #2 an ob-gyn doctor, and #3 a zoo keeper specializing in endangered birds.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.