Building Storage Right with DAOS and Intel Optane PMem 200

ID 673473
Updated 12/17/2020
Version Latest



Have you seen the early performance numbers for the Intel® Optane™ persistent memory 200 series with the newest Intel® Xeon® architecture? More about the actual numbers later…, I've been thinking recently about the importance of Optane technology for the storage industry in general, and for DAOS in particular. Let’s begin with a little bit of DAOS development history. And if you've never heard the acronym “DAOS” from Intel, now is the right time to check for basic information on DAOS.

Over eight years ago, the HPC storage team at Intel came up with a brilliant idea to respond to challenges they found in Lustre scalability. Being experienced with mass scale deployments, they could clearly see the limitations of the file system performance scalability for big installations. This mainly came from the filesystem software stack, rather than being limited by the storage media technology at that time. The fundamental limitations of the legacy storage software stacks became even more obvious later when much faster SSDs options came out. There were a number of bottlenecks in the legacy solutions, but chief among these came from POSIX and block-based IO.

POSIX takes a very pessimistic approach to ensuring data consistency, proactively locking in any situation where a conflict might occur. Imagine if a user opened a million files for writing and the number of locks might happen to preserve a file system consistency? This pessimistic approach just does not scale, so it became necessary to consider other ways to provide data consistency to unlock the filesystem for future scalability. DAOS instead borrows techniques from the database space to offer an optimistic approach to data consistency, eliminating the need for so much locking.

Similarly, all persistent IO up to this point has been done on media that we write to in large size blocks. This poses a real performance dilemma for IOs smaller than the block size, including file system metadata, as small IOs sharing a block would then be cause for more locking and serialization of activity, and less parallelism. Breaking through this bottleneck required innovation not only in the software stack, but also in the underlying storage media.

Breaking through these barriers would never be possible without storage hardware innovation, the transition away from SATA to NVMe and the appearance of Storage Class Memory (SCM) products, as well as coordinated software stack innovation- new ways to access data, defining new interfaces such as key-value, HDF5, MPI-IO, Python api and a legacy POSIX interface. Kelsey Prantis, Senior Software Engineering Manager of the DAOS team at Intel writes about that in her blog.

Intel Optane Persistent Memory (PMem) is a fundamental technology that DAOS builds upon. DAOS is designed from the ground up for performance. No trade-off – no legacy layers in hardware, no legacy in software. Just think of a chef who cooks a great meal from only the best ingredients. You pick great hardware - PMem and NVMe SSDs – and great software - PMDK and SPDK user space libraries to talk to the media directly - and design a high efficiency software stack on top of that with rich functionality attached over RDMA-enabled fabric.

“Why is PMem required for DAOS and can't be substituted?” is a question I hear quite often. There are multiple reasons. Kelsey Prantis brought several of them for your consideration in her blog. PMem is used for the shared store of Metadata information and Small I/O tier. Metadata access is of very low granularity and is just moving away from block storage to a cache line that simplifies its operations and needs to keep an active DRAM buffer. Small I/O on the other end gets stored into PMem on writes, which is defined by the DAOS policy engine. This allows us to optimize bulk data writes to NVMe SSDs for better SSD performance (higher block size delivers better bandwidth) and SSD endurance which is also dependent on the write pattern. On the customer side, this benefit can result in using less expensive grade SSDs, such as moving from high-endurance SSDs to mid-endurance and/or to standard-endurance drives, or even moving between technologies such as TLC-based NAND SSDs to QLC SSDs. This is a part of the nature of NAND media design, writing in a bigger block size substantially improves the amount of the total write to the media. Read more about that in my endurance blog, where I go over those details.

And for improved SSD endurance and lower cost of bulk storage, we should thank PMem. The final observation about the needs of PMem in the DAOS design is the ability to put RDMA into the PMem region directly. Surely, that's the benefit of how a DAOS client makes a transaction to the DAOS server, which makes for lower CPU clocks, since you don't require a mem copy. It's a good demonstration of how code can be more efficient with better hardware. 

All primary DAOS operations include PMem interaction and are critical to the performance path. So, can having a product of the next generation with incremental performance improvement on the hardware, be easily translated into the software stack performance improvement? Yes, it can. How can we measure this? Well, for HPC storage we have a good set of tools designed to stress various aspects of it, such as metadata, file I/O, chunk size and scalability by leveraging multiple clients. IOR and MDTEST are two commonly used benchmarks for this. They are different by design and often it’s required to use both for a complete storage performance assessment. As such, the IO500 Storage performance list is solely based on the combination of those benchmarks. DAOS already showed outstanding performance with these benchmarks by winning the #1 ranking in the IO500 list at ISC'20, utilizing the Intel Optane PMem 100 series.

MDTEST is designed to stress the metadata performance. It’s quite common in HPC to see parallel access to many files opening for reads, writes, attribute updates, creation or deletion. On a big scale, those operations can create a significant bottleneck in the file system performance as they typically cause locking (remember why DAOS earlier?). MDTEST measures how quickly you can serve those specific operations. IOR, on the other hand, measures throughput performance. Unlike MDTEST, IOR is dealing with I/O operations, such as reading and writing into multiple files simultaneously using many clients. This is where actual bandwidth is measured across many files simultaneously from many clients. For DAOS performance analysis with PMem, we considered both benchmarks to be critically important, being representative for the overall filesystem performance and providing clear guidance recognized by the industry.

Now, let's look at the data shared at Intel Memory and Storage Moment 2020. For the first time, Intel shared preliminary performance numbers of a DAOS instance based on the 2nd Generation Intel Xeon Scalable architecture with Intel Optane PMem 100 Series vs the Intel Optane PMem 200 Series running in brand new system design. It's critical to note how the extra performance boost of the component is translated into overall better performance under MDTEST and IOR scenarios. In this early testing, DAOS saw an amazing 58% improvement  in write bandwidth just in upgrading to the next generation platform, with no optimizations to DAOS for the new hardware. This shows great promise for what the Intel Optane PMem 200 series will enable in performance for DAOS, as well as many other applications. For more information on the performance of DAOS over the Intel Optane PMem 200 series, please see the following video where Kelsey Prantis shares more details of our early testing.

Goto Video at