DirectStorage 1.1 for Intel GPUs

Published: 11/04/2022

By Sreenivas Kothandaraman, Hisham Chowdhury, and Allen Hux

Throughout the past two years, Intel worked closely with Microsoft* to co-engineer and optimize DirectStorage, a new game assets transport and compression technology. The co-design with Microsoft enables the DirectStorage 1.1 runtime (see Figure 1) to discover and invoke highly optimized and high-performance driver-resident GPU decompression for Intel GPUs, including the recent Intel® Arc™ family of GPUs.

The Intel® Arc™ graphics driver 101.3793 includes the DirectStorage 1.1 optimization for Intel Xe architecture, and offers the immediate value in DirectStorage enabled workloads, or games of reduced load-times, for systems equipped with Non-Volatile Memory (NVMe) SSDs. For the long term, and with a larger install base of NVMe storage systems, DirectStorage redefines the way we look at assets streaming, and opens up a whole new world of exciting advances for the next generation of games. The early benefits of DirectStorage incentivize the adoption of NVMe, and bring us closer to the longer-term vision.

""

Figure 1. DirectStorage stack on Intel® Architecture.

Software Advances with Modern Hardware

As mentioned in the Microsoft preview blog, DirectStorage takes advantage of the improvements of modern NVMe devices, and the power of GPUs, to improve load times for games. DirectStorage for Windows* makes file-loading an integral component of the graphics API. It greatly reduces the complexity of loading disk assets into DirectX* 12 resources, while providing high performance—and, on Windows* 11, leverages a lower overhead I/O path to reduce CPU overhead. DirectStorage provides industry-standard GPU decompression on any DirectX* 12-capable devices.

Intel and Microsoft made sure that the DirectStorage 1.1 API is not simply enabled, but also highly performant—and highly efficient on Intel GPUs. For more details about the DirectStorage API, check here.

Below is a performance comparison running a highly optimized DirectStorage sample. The sample measures the total loading time, from compressed data reads on the storage system to availability of uncompressed data to the GPU.

""

Figure 2. Demo running on the Intel® Core™ i9-12900K CPU; the measured bandwidth is 7.88 GB/s.

""

Figure 3. Demo running with DirectStorage1.1 GPU decompression on Intel® Arc™ A770 16GB; the bandwidth increased to 21.67 GB/s.

As shown, DirectStorage 1.1, with the Intel-optimized software stack, provides a 2.7x improvement over a non-GPU accelerated path for the above workload running on a 16-core CPU. The performance benefit comes from highly optimized GPU decompression, and more efficient asset transfers to the GPU.

High-performance Assets Streaming

Loading all the assets of a game level with DirectStorage is simple enough, but emerging graphics workloads load assets constantly, treating high-speed storage as a massive read-only, last-level cache. Streaming technology enables a scene containing hundreds of gigabytes of assets to be supported by 1/1000th as much physical memory. To explore this usage, and understand techniques to optimize performance, Intel built Expanse. It’s a simple demo of a virtual texturing system that now supports GPU decompression.

""

Figure 4. A scene from Expanse showing nearly 1,000 textures, each over 350 MB in size, uncompressed, using about 100 MB of physical GPU memory.

Expanse uses Direct3D* 12 (D3D12) sampler feedback to determine which tiles of each texture (D3D12-reserved resources) must be uploaded to render the scene correctly. Parts of a texture that aren’t visible (back-facing or occluded) are quickly recycled for use by newly visible tiles.

The texturing system details are somewhat complicated, but the final operation is simple and consistent: enqueue a DirectStorage request for each required tile. Then, after a few requests, call EnqueueSignal() to signal a fence, then Submit() the outstanding work. Because file accesses are simple and confined to a single routine, Expanse is an easy application to profile and optimize. In benchmark mode, Expanse acts as a proxy of future, demanding asset-streaming applications.

The team increased the bandwidth of Expanse by up to 2x through refinement of its heuristics guiding DirectStorage API calls. Below are some of the steps taken, and key findings:

  1. Set the capacity for file queues to maximum (DSTORAGE_MAX_QUEUE_CAPACITY).
  2. Many parts of the system benefit from having more work in flight, including the disk.  Maximizing the number of requests per submit, therefore, is critical.  Techniques used for Expanse include:
    1. Instrument your code to count the number of requests and submits. Work on heuristics in your code to maximize this ratio.
    2. Add instrumentation to see the number of submits for each frame. You should address cases where multiple submissions are occurring in a single frame. You may need to call Submit more frequently if latency is critical; for example, to prioritize foreground versus background assets.
    3. Record all your application’s DS requests and submits to a file. Search for cases where submits are occurring with no or few requests, indicating errors in your heuristics.
    4. Create a mechanism to playback the requests and submits from above, independent of the application (and any other 3D or compute shaders). This helps fine-tune the staging buffer size for peak bandwidth. For example, 128MB provided a good balance of performance and GPU memory size for Expanse. The staging buffer size must be among the first things set by the application, via SetStagingBufferSize().
  3. Try to keep request size (compressed, on disk) above 64KB to reduce file overfetch, and improve efficiency within the DirectStorage CPU runtime and GPU shaders. Expanse uses 64KB tiles today and future performance improvement exploration will include loading larger texture regions (e.g., 2x2 tiles instead of single tiles) to reduce the overfetch that results from smaller, unaligned requests.
  4. Lower performance platforms may benefit from a smaller staging buffer, as they won’t be able to stream as quickly. Keep this in mind for platforms that are also likely to be memory constrained.
  5. The default setting for the number of submit threads is sufficient to saturate most storage devices. Intel does not recommend changing this value.

Expanse includes all the instrumentation described above. It can output per-frame statistics, including number of requests and submits, and it includes a trace capture and playback system. The profiling results for a high-end platform containing an Intel® Core™ i9-12900 CPU and Intel® Arc™ A770 GPU show that Expanse (in benchmark mode) can average hundreds of tiles uploaded per frame, and sometimes thousands of requests for a single submit. With compressed assets, it is possible for the uncompressed bandwidth to exceed the theoretical performance of the disk, or the PCIe interface.

New Game Design Paradigms

Intel envisions a future where fast storage is valued for more than its non-volatility, and is reckoned as the last level of a memory hierarchy, to be exploited through emerging game-design technologies that stream assets in steady state. This vision of steady state data-transport from storage to GPU local memory, while rendering, is in contrast to the traditional burst of data transport at the beginning of game levels today.

GPU decompression naturally competes for resources with rendering. Ideally, the work is complementary—that is, if decompression is memory-bound, and rendering is compute-bound, then decompression could be essentially free. In practice, the experience will depend on differences in platforms and software. For example, Expanse’s frame rate is not typically affected by GPU decompression, nor does frame rate noticeably affect bandwidth. However, in a targeted benchmark mode that tests the platform I/O performance, the frame rate is measurably affected by GPU decompression. Early data and investigation of this tension in real workloads is extremely interesting, and will feed into hardware and software roadmaps for the years to come.

Resources

Expanse, a simple demo of a virtual texturing system that now supports GPU decompression.

Microsoft DirectStorage API Reference

Microsoft DirectStorage original announcement

Microsoft DirectStorage samples Repo

Microsoft DirectStorage announcement

DirectX* download page

Intel® Arc™ graphics landing page

Workloads and Configurations

Claim 

GPU(s) 

System Configuration 

Measurement 

Measurement Period 

DirectStorage GPU Decompression

Intel® Arc™ A770
graphics delivers up to
2.7x faster assets loading
times compared to Intel
Core i9-12900K when
using the new
DirectStorage 1.1
API with GPU
decompression
acceleration.

Intel® Arc™ A770
16GB Graphics 
Processor: Intel® Core™ i9-12900K,
Asus ROG Maximus Z690 Hero, Discrete Graphics:
Intel® Arc™ A770 16GB Limited Edition, Graphics Driver: ci-master-12647,
Memory: 32GB (2x16GB) DDR5 @6000MHz, Storage: Samsung 980 Pro 1TB NVMe SSD, OS: Windows 11 Pro v21H2 Build 22621.675
The DirectStorage technical demo measure assets load time from the storage system to the GPU memory and reports bandwidth data as GB/s. It compares GDeflate with GPU decompression to Zlib with CPU decompression. Performance numbers were run 5 times for each decompression method. Nov3 - Nov4, 2022

Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting https://www.intel.com/design/literature.htm.

Intel, the Intel logo, Intel® Core™, and Intel® Arc™ are trademarks of Intel Corporation in the U.S. and/or other countries.

Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

*Other names and brands may be claimed as the property of others.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.