This article describes two methods, the Flexible I/O (fio) tool and SPDK Perf, for evaluating the disk performance of NVMe SSDs. It summarizes common performance problems encountered when using the Storage Performance Development Kit (SPDK) or the kernel.
Test Disk Performance With Flexible I/O (fio) Tools
SPDK uses asynchronous input/output (I/O) and polling work mode, which is usually compared in performance tests with kernel asynchronous I/O (AIO). We describe how to use fio to evaluate the performance of kernel AIO and two modes of the SPDK fio_plugin, NVMe and bdev.
Using Fio to evaluate kernel asynchronous I/O (AIO)
Fio supports multiple I/O engine modes, including AIO, which uses the ioengine libaio. When testing AIO, set the ioengine in the fio startup configuration file to be libaio. For AIO, I/O requests are sent to the appropriate queues, where they wait to be processed, so queue depth will affect disk performance. Therefore, when testing AIO, specify the corresponding queue depth (iodepth) according to the characteristics of the disk.
An example of fio configuration parameters for testing kernel AIO is:
[Global] ioengine=libaio direct=1 rw=randrw rwmixread=100 (100% reads), 70 (70% reads 30% writes), 0 (100 writes) thread=1 norandommap=1 time_base=1 runtime=300s ramp_time=10s bs=4k iodepth=32 Numjobs=1 [test] filename=/dev/nvme0n1
- ioengine: specifies the I/O engine. For AIO of Kernel, the I/O engine should be specified as libaio
- direct: specifies direct I/O mode O_DIRECT, hence I/O will bypass the system's page buffer
- rw: read-write mode, where randrw is specified to indicate mixed random read and write
- rwmixread: the proportion of read requests in mixed random read and write mode
- thread: specifies to use thread mode. Since SPDK fio_plugin only supports thread mode, when compared with Kernel, it is usually unified to specify the thread mode to compare
- norandommap: when I/O is specified, a new random offset is acquired each time to prevent additional CPU usage
- time_based: specifies time mode
- runtime: test duration
- ramp_time: the time that is run before the performance is counted to avoid the influence caused by the unsteady state
- bs: I/O block size
- iodepth: queue depth
- numjobs: the number of workers
- filename: object to test
Note: For disk performance testing, you’ll want to bypass the system’s page buffer. Set the parameter direct=1, to specify that measured performance is of the real physical disk.
The NVMe-based fio_plugin aims to evaluate the performance of SPDK-based physical NVMe Solid State Drives (SSD).
- Download and compile fio:
git clone https://github.com/axboe/fio cd fio && git checkout fio-3.3 make
- Download and compile the SPDK:
git clone https://github.com/spdk/spdk cd spdk && git submodule update --init ./configure --with-fio=/path/to/fio/repo <other configuration options> make
Note: Since the fio_plugin relies on dependencies provided in fio, specify the fio directory when running the configuration script. Otherwise the default fio_plugin will not compile.
- To use the SPDK fio_plugin with fio, specify the plugin binary using LD_PRELOAD when running fio.
LD_PRELOAD=<path to spdk repo>/examples/nvme/fio_plugin/fio_plugin
- Second, set ioengine=spdk in the fio configuration file.
- When running fio, specify both the fio configuration file and the device PCI address information that SPDK can recognize with the additional parameter '--filename'. In general, fio_plugin supports two modes. One is the local NVMe* device, namely NVMe over PCIe*. The other is the remote NVMe device, namely NVMe over Fabrics.
NVMe over PCIe:
LD_PRELOAD=.../fio_plugin fio config.fio '--filename=trtype=PCIe traddr=0000.06.00.0 ns=1'
NVMe over Fabrics:
LD_PRELOAD=.../fio_plugin fio config.fio '--filename=trtype=RDMA adrfam=IPv4 traddr=192.0.0.1 trsvcid=4420 ns=1'
- For cases where a core is used to test multiple disks, it is usually only necessary to set numjob to 1 and specify multiple disks to be tested via multiple filename parameters in the fio command (multiple filename parameters are separated by spaces). For example, to test three disks simultaneously:
LD_PRELOAD=.../fio_plugin fio config.fio '--filename=trtype=PCIe traddr=0000.06.00.0 ns=1' '--filename=trtype=PCIe traddr=0000.07.00.0 ns=1' '--filename=trtype=PCIe traddr=0000.08.00.0 ns=1'
- For more information about the spdk ioengine, refer to the relevant parameter description using the following command:
LD_PRELOAD.../fio_plugin fio --enghelp=spdk
- In addition, the absolute path of the fio_plugin binary can be specified directly in ioengine without dynamically loading LD_PRELOAD every time when running fio. Add the ioengine=<path to spdk repo>/examples/nvme/fio_plugin/fio_plugin in the fio configuration file; run fio config.fio '--filename=trtype=PCIe traddr=0000.06.00.0 ns=1' to test.
The block device layer (bdev)–based fio_plugin sends I/O on the SPDK block device. For the NVMe-based fio_plugin, I/O is processed directly on the physical disk. Hence the biggest difference between the two is whether or not I/O goes through the bdev layer. Therefore, a bdev-based fio_plugin can evaluate the performance of the SPDK block device layer.
The compilation and installation of the bdev-based fio_plugin is exactly the same as the NVMe-based fio_plugin.
- To use the bdev-based fio_plugin with fio, specify the plugin binary using LD_PRELOAD when running fio.
LD_PRELOAD=<path to spdk repo>/examples/bdev/fio_plugin/fio_plugin
- Set ioengine=spdk_bdev in the fio configuration file.
- Specify the spdk startup configuration file in the fio configuration file. As follows:
The configuration of all bdevs is specified in the spdk startup configuration file. For examples:
[Malloc] NumberOfLuns 1 LunSizeInMB 128 [Nvme] TransportID "trtype:PCIe traddr:0000:82:00.0" Nvme0 RetryCount 4 TimeoutUsec 0 AcctionOnTimeout None AdminPollRate 100000
- When running fio, directly specify the name of the bdev to be tested by '--filename'. The example is:
LD_PRELOAD=.../fio_plugin fio config.fio '--filename=Nvme0n1'
- When testing multiple devices using the bdev-based fio_plugin, it is necessary to include the corresponding bdev configurations in the SPDK startup configuration file. When the fio runs, the multiple filename parameters should be specified, separated by spaces. For example, two devices Malloc0 and Nvme0n1 are tested simultaneously, as shown below:
LD_PRELOAD=.../fio_plugin fio config.fio '--filename=Nvme0n1' '--filename=Malloc0'
- Similarly, to get more information about the spdk_bdev ioengine, use the following command:
LD_PRELOAD.../fio_plugin fio --enghelp=spdk_bdev
- Note that the absolute path of the fio_plugin binary can be specified directly in ioengine without dynamically loading LD_PRELOAD every time when running fio. Add the modified ioengine=<path to spdk repo>/examples/bdev/fio_plugin/fio_plugin in the fio configuration file; run fio config.fio '--filename=Nvme0n1' to test.
Test Disk Performance With SPDK Perf
NVMe-based perf tool
After successful compilation of SPDK, the binary file of the perf tool can be found in the spdk/examples/nvme/perf/ directory. The method of using perf is:
perf -c <core mask> -q <I/O depth> -t <time in seconds> -w <io pattern type: write|read|randread|randwrite> -s <huge memory size in MB> -o <I/O size in bytes> -r <transport ID for local PCIe NVMe or NVMeoF>
For more information about perf parameters, please use the command perf –help.
Perf supports both local NVMe devices and remote NVMe over Fabrics (NVMe-oF) devices. An example is:
NVMe over PCIe: perf -q 32 -s 1024 -w randwrite -t 1200 -c 0xF -o 4096 -r 'trtype:PCIe traddr:0000:06:00.0' NVMe over Fabrics: perf -q 32 -s 1024 -w randwrite -t 1200 -c 0xF -o 4096 -r 'trtype:RDMA adrfam:IPv4 traddr:192.0.0.1 trsvcid:4420'
For testing multiple disks simultaneously, add -r and specify the device address, for example, a core testing three disks:
perf -q 32 -s 1024 -w randwrite -t 1200 -c 0x1 -o 4096 -r 'trtype:PCIe traddr:0000:06:00.0' -r 'trtype:PCIe traddr:0000:07:00.0' -r 'trtype:PCIe traddr:0000:08:00.0'
Perf evaluates kernel asynchronous I/O (AIO)
For testing kernel AIO, the usage is the same as for testing of the SPDK driver. Add the device name after the perf command. For example:
perf -q 32 -s 1024 -w randwrite -t 1200 -c 0xF -o 4096 /dev/nvme0n1
Bdev-based perf tool
After successfully compiling SPDK, the binary run file of the bdev perf tool can be found in the spdk/test/bdev/bdevperf/ directory. Here’s how to use bdevperf:
bdevperf -c <config> -q <I/O depth> -t <time in seconds> -w <io pattern type: write|read|randread|randwrite> -s <huge memory size in MB> -o <I/O size in bytes> -m <core mask>
For more information about parameter analysis, please use the command perf –help.
Among the parameters, -c is the configuration file of bdevperf. The bdev devices to be tested are specified in the configuration file. For example, to test two local NVMe devices, the bdevperf configuration file is:
[Nvme] TransportID "trtype:PCIe traddr:0000:82:00.0" Nvme0 RetryCount 4 TimeoutUsec 0 AcctionOnTimeout None AdminPollRate 100000
An example of the corresponding bdevperf initializing parameters is:
bdevperf -q 32 -s 1024 -w randwrite -t 1200 -o 4096 -m 0xF -c bdevperf.conf
For bdevperf, to test multiple disks, simply configure information about multiple disks in the SPDK startup configuration file, for example, when testing three disks simultaneously:
[Nvme] TransportID "trtype:PCIe traddr:0000:82:00.0" Nvme0 TransportID "trtype:PCIe traddr:0000:83:00.0" Nvme1 TransportID "trtype:PCIe traddr:0000:84:00.0" Nvme2 RetryCount 4 TimeoutUsec 0 AcctionOnTimeout None AdminPollRate 100000
Why is the performance obtained by perf better than that of fio?
The biggest difference between the two tools is that fio is integrated with Linux* fio tools so that it can test SPDK devices with the fio_plugin engine. However, due to the problem of fio's own architecture, SPDK can't be fully utilized. And the whole application framework of fio still uses its own architecture instead of SPDK application framework.
For example, fio uses the Linux thread model, and threads are still scheduled by the kernel. Perf is a specified performance evaluation tool for SPDK. Therefore, not only is I/O sent through SPDK, but also some underlying application frameworks are designed for SPDK. For example, in the thread model just mentioned, in perf, the thread model provided by DPDK is used to bond the CPU core to the thread by using CPU affinity, which is no longer subject to kernel scheduling. Therefore, the advantage of asynchronous lock-free can be fully exploited. This is why the performance measured by perf is higher than that of fio, especially when using a single thread (single core) to test multiple disks at the same time. Therefore, when under the same situation, it is recommended that users use the SDPK perf tool to evaluate SPDK performance.
Why haven’t I seen the significant performance improvement described by SPDK documentation?
As explained in the previous question, performance results are different for different tools, and the most important factor is the performance bottleneck of the hard disk itself. For example, Intel® SSD DC P3700 Series with the 2D NAND medium has its own bottlenecks, so neither SPDK user mode driver nor kernel driver can achieve maximum I/O operations per second (IOPS).
If using a higher performance SSD, for example, using Intel® Optane™ DC SSD P4800X Series with Intel® Optane™ technology as the test object, the results will show a big performance difference. Therefore, the higher the performance of the hard disk, the more obvious the advantages of SPDK. This is the original design intention of SPDK – to be customized for high-performance NVMe SSDs.
What is the optimal iodepth and number of CPU cores for different NVMe SSDs?
Iodepth and number of CPU cores are usually selected based on the characteristics of different NVMe SSD types. Usually when evaluating hard disks with 2D NAND or 3D NAND media, a higher iodepth (128 or 256) is better for maximum disk performance. For Intel® SSD DC P4500 Series, it is possible that one CPU core cannot reach maximum IOPS due to limitations of the hard disk’s queue, rather than the capability of the CPU core. In order to achieve the specification’s maximum IOPS, it is better to use two CPU cores for Intel SSD DC P4500 series. For Intel Optane DC SSD P4800X series, usually only a single core is needed, with a smaller iodepth, to reach maximum IOPS. At this time, the upper limit of the hard disk has been reached. If iodepth is increased continually, the latency will only become larger and the IOPS will no longer grow.
In summary, the evaluation parameters recommended for different NVMe SSDs are given below:
Intel® SSD DC P3700 series: numjob=1, iodepth=128
Intel SSD DC P4500 series: numjob=2, iodepth=256
Intel Optane DC SSD P4800X series： numjob=1, iodepth=8/16/32
Why is the write performance unreasonably high?
Hard disks, usually based on 2D NAND and 3D NAND, often greatly exceed the highest specified write values when testing the performance of write/randwrite, due to the problems with the medium. Therefore, when testing 2D NAND and 3D NAND hard disks , it is better to precondition the disk to avoid this phenomenon. The usual practice is to keep writing to the disk after formatting, filling it up and making it stable. Take the Intel SSD DC P3700 800GB as an example. Usually, it is sequentially written with 4KB block size for two hours, and then randomly written for one hour. In addition, during the test, the ramp_time in the fio parameter can be set larger to avoid an initial unreasonably high value being calculated in the final result.
What are significant evaluation indicators for disk performance?
Generally, the performance of a disk is evaluated primarily for three aspects: IOPS, bandwidth, and latency.
IOPS: Evaluations focus mainly on a block size of 4k, for random reads and writes. Therefore, the usual fio key parameters are: bs=4k, iodepth=128, direct=1, rw=randread/randwrite.
Bandwidth: Evaluates the bandwidth of a disk, usually in the case of a block size of 128k under sequential reads and writes. Therefore, the usual fio key parameters are: bs=128k, iodepth=128, direct=1, rw=read/write.
Latency: The assessment of latency is usually focused on the delay of an I/O send/complete, so generally iodepth is chosen to be 1. Therefore, the usual fio key parameters are: bs=4k, iodepth=1, direct=1, rw=randread/randwrite. When looking at latency results, the user should pay attention to not only the average value, but also to the long tail latency (i.e. 99.99%).
Procedures for performance evaluation of SPDK-based NVMe* SSDs are described in this article, including two tools and their three modes. The article also explains the common problems for SPDK performance evaluations. To get more information about SPDK, please visit the SPDK community, spdk.io.