Ceph is an open-source, distributed software platform, that provides scalable and reliable object, block, and file storage services. OSD (object storage daemons) is an important component of Ceph responsible for storing objects on a local file system. The performance of OSD plays an important role in the overall performance of Ceph.
This article describes some important configurations at the system level that may influence the cluster performance. We provide comparative data using Ceph and corresponding analysis. Some common settings like CPU turbo are not included here.
Automatic NUMA Balancing
Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time is determined by the memory proximity to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory.
There is an automatic NUMA balancing setting which improves the performance of applications running on NUMA hardware systems. Automatic NUMA balancing moves tasks closer to the memory they are accessing. It also moves application data for memory closer to the tasks that reference it. Ceph OSD performs better when the threads are accessing memory on the same NUMA node as the threads on which they are scheduled (generally the same with the disk).
Figure 1 below compares normalized performance with numa_balancing and without numa_balancing.
We can see the 4K random read case is the case most affected by numa_balancing settings. There is a metric metric_NUMA %_Reads addressed to local DRAM that significantly illustrates this difference. With numa_balancing, the reads addressed to local DRAM is 71.17%, while without numa_balancing this value is 34.39%. This indicates numa balance settings does affect performance in some cases.
TCMalloc
TCMalloc is Google's customized implementation of the C malloc() operator and the C++ operator newly used for memory allocation within our C and C++ code. TCMalloc is a fast, multi-threaded malloc implementation. In Ceph, TCMalloc is used instead of libc by default, and it has been proved that the performance of TCMalloc is significantly better than libc.
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is an important configuration in TCMalloc. It limits the total number of bytes allocated to thread caches. The default value of 16M may not be sufficient for some applications such as Ceph. Increasing this value may improve performance, at the cost of extra memory use by TCMalloc.
Figure 2 below compares default 16M and 128M memory limits using Ceph. Increasing the value dose help in 4K random write case with a large amount of memory allocation. The most appropriate value setting needs to be determined case by case.
Maximum Transmission Unit
In computer networking, the maximum transmission unit (MTU) is the size of the largest protocol data unit (PDU) that can be communicated in a single network layer transaction. The MTU relates to the maximum frame size that can be transported on the data link layer.
The standard MTU value is 1500 due to some history reasons. Ethernet itself has the concept of “jumbo frames”, where the MTU can be set up to 9000 bytes. Generally, large MTU values would help the large IOs. Figure 3 below shows the performance difference between the default MTU value of 1500 to the larger MTU value of 9000 when using Ceph. Setting the MTU value to 9000 can improve performance by 16%, compared to the default value of 4M in the sequential write case.
Summary
This article describes three important configurations that may affect the performance of distributed storage systems. Auto NUMA balancing may improve the performance of small IO reads. The memory limit of TCMalloc may improve the cases that need a large amount of memory allocation. Increasing MTU may improve the performance of large IOs. These configurations need to be determined according to the actual situation.
Reference
- Distributed Storage System(https://cloudian.com/guides/data-backup/distributed-storage/)
- Ceph (https://ceph.io/en/)
- NUMA (https://en.wikipedia.org/wiki/Non-uniform_memory_access)
- tcmalloc (https://github.com/google/tcmalloc)
- MTU(https://en.wikipedia.org/wiki/Maximum_transmission_unit)