IDC forecasts in their report, Data Age 20251, that by 2025 the global datasphere will grow to 163 zettabytes (ZB), which is a trillion gigabytes. This is ten times the 16.1 ZB of data generated in 2016. The report also describes five key trends that will intensify the role of data in changing our world, one being cognitive/artificial Intelligence (AI) systems. IDC estimates that the amount of analyzed data that is “touched” by cognitive systems will grow by a factor of 100 to 1.4 ZB in 2025.
Apache Hadoop* is a highly scalable system designed to process large amounts of data and is widely used in big data analytics. However, traditional Apache Hadoop nodes share both processing and storage responsibilities, with data stored in the native Apache Hadoop Distributed File Systems (HDFS) on hard drives that are co-located with Apache Hadoop nodes. As a result, both compute and storage resources are bonded to the same physical node and cannot be scaled independently. Expanding the on-premises Apache Hadoop cluster by purchasing more Apache Hadoop nodes eventually leads to over-provisioning and to a waste of either computing or storage spending. Independently scaled compute workloads and storage capacity provide flexibility and significantly reduce the total cost of ownership (TCO).
Meanwhile, as one of the emerging notions – “data gravity”2 – indicates, as the quantity of data grows, more and more applications and services will be “drawn” to where the data is stored. As a result, companies are avoiding moving the data into the cloud after it has been stored; instead, they are building an analytics solution on the data lake.
In this paper, we introduce a new architecture to unlock big data analytics efficiency with disaggregated compute and storage, together with a reference architecture, performance tunings, software optimization, and comparisons with a traditional on-premises deployment solution. We will focus on two typical analytics scenarios: batch analytics and interactive analytics. The former usually run on MapReduce (MR) and Hive, which analyze cold data or not-so-hot data offline, while the latter scenario focuses on analyzing recent data for returning results within minutes with Apache Spark* SQL/Hive SQL. In our architecture, the underlying storage system based on the Ceph*3 object storage system (an open source distributed storage software solution widely adopted in public and private clouds) provides block, object, and file interface in one single platform. S3A is used to connect Apache Hadoop services and the Ceph object storage system; it is a file system client connector used by Apache Hadoop to read and write data from Amazon S3 or a compatible service.
As solid-state drives (SSDs) become more affordable, cloud providers are working to provide high-performance, highly reliable all flash-based storage for their customers. Our reference architecture is based on Intel all-flash array (AFA) systems to showcase both the performance and cost benefit that can delivered by this innovative architecture.
Our study showed that big data analytics on disaggregated object storage provides the same functionality as an on-premises Apache Hadoop cluster and delivers competitive performance. And by leveraging Intel® SSD Data Center P3520 Series NVMe4, a 10-node all-flash cluster provides the same performance as a 60-node HDD cluster, with significant lower TCO.
Apache Hadoop* Cluster Challenges
Scalability
HDFS is designed as a scalable file system to support thousands of nodes in a single cluster. The data nodes scale well in a single namespace; however, the NameNode cannot be scaled horizontally in a single namespace. Often, the performance of jobs across the cluster is limited by the performance of a single NameNode. HDFS federation was introduced to address the namespace scale-out issues. However, the system administrator needs to maintain multiple NameNodes and the load balance service, which will increase the management cost.
Bonded compute and storage architecture
Data volume is growing at an unprecedented rate, often without consummate growth in computation needs. In a traditional Apache Hadoop cluster, the storage and compute resources are bonded together. One common issue for such a setup is where an organization purchases more storage for their on-premises Apache Hadoop cluster and they might purchase computing power that is not necessarily needed. If the data reaches petabyte level, the waste can result in significant capital expenditures. Independently scaled compute and storage can significantly reduce the cost.
Big Data Analytics on Disaggregated Storage
Unified Apache Hadoop* file system and API
Though Apache Hadoop is built on HDFS, it also works with other file systems - blobstores such as Amazon S3 and Microsoft Azure* storage, as well as alternative distributed file systems5. Those file systems are called Apache Hadoop Compatible File System (HCFS). Figure 1 illustrates the popular HCFS and its connectors – including Amazon S3, Google Cloud Platform*, Windows Azure, and Aliyun OSS. Apache Hadoop officially supports storage solutions with vendor-specific adaptors. The S3A connector is one of the most popular connectors as it works for compatible object storage systems. A detailed comparison of those connectors can be found in the Spark Summit talk Spark and Object Stores —What You Need to Know6.
Figure 1. Apache Hadoop* Compatible File System.
Architecture overview
The S3A connector makes running big data analytics workloads on top of object storage system possible. Figure 2 presents an architecture to run big data analytics workloads on top of object storage through HCFS connectors.
Figure 2. Big data analytics on cloud object storage solution overview.
In our work, the S3A connector was chosen due to its popularity. S3A allows you to connect your Apache Hadoop cluster to any S3-compatible object store, which enables a new tiered storage that can be optimized for cost and capacity. The S3A file system uses Amazon's libraries to interact with Amazon S3. It uses the URI prefix s3a://. S3A is the recommended protocol for Apache Hadoop 2.7 and later.
Figure 1 is a reference architecture for big data analytics over disaggregated object storage, but it might have multiple implementations. For example, for the compute side, we could either deploy computation frameworks and applications on bare metal or in a container to run computing jobs by Yarn or Kubernetes, to provide faster deployments, enhanced security, and multitenancy support. For the storage side, we can use an object storage system such as Ceph, Swift*, or Aliyun OSS, even remote HDFS if you do not want to migrate your old data. There are multiple deployment considerations for the object storage, including: co-locating the storage nodes with the gateway, Dynamic DNS or load balancer for the gateway, data protection via storage replication or erasure code, storage tiering, and others. This work will present performance comparisons of several different deployment choices.
Reference cluster topology
To evaluate the functionally and performance of big data analytics over disaggregated storage, the reference cluster shown in Figure 3 was built. The reference cluster consists of 11 nodes, including one head node running DNS server, five compute nodes, and five storage nodes. All of the nodes are equipped with Intel® Xeon® processor E5-2699 v4. As solid-state drives (SSDs) become more affordable, they provide high-performance, low-latency storage solutions based on AFA architecture, and also significantly reduce the operational cost since SSDs have excellent endurance and reliability compared with HDDs. So, for the storage nodes, we used 4x Intel® SSD DC S3510 Series7 and 1x Intel® SSD DC P3700 Series8 for Journal nodes. Figure 3 shows the cluster topology. The detail hardware and software configurations are shown in Tables 1 and 2.
Figure 3. Cluster topology.
Table 1. Apache Hadoop* Cluster Configuration.
CPU | Intel® Xeon® processor E5-2699 v4 @ 2.2 GHz |
Memory | 4x 32 GB DDR4 2666 MHz |
NUC | 1x Intel® 82599 10 Gigabit Ethernet (GbE) Controller |
Storage | Shuffle: 2x Intel® SSD DC S3700 400 GB |
Software Configuration | RHEL* 7.5 Apache Hadoop version 2.7.3/2.8.1 Spark* version 2.1.0/2.2.0 |
Table 2. Ceph* Cluster Configuration.
CPU | Intel® Xeon® processor E5-2699 v4 @ 2.2 GHz |
Memory | 4x 32 GB DDR4 2666 MHz |
NUC | 1~3x Intel® Ethernet Converged Network Adapters X710 10/40 GbE |
Storage | Data: 4x Intel® SSD DC P3510 2.0 TB Journal: 1x Intel® SSD DC P3700 2.0 TB Ceph* Jewel 10.2.10, Filestore as backend, Erasure Coding k=3 m=2 |
Software Configuration | RHEL* 7.5, Ceph* version 10.2.10 |
Various deployments and configurations were evaluated to get optimal results for this solution, including a dedicated radosgw node (Ceph Object Gateway daemon), radosgw collocated with OSD nodes, and dynamic load balancing. The reference cluster shown in Figure 3 is the optimal architecture that provides balanced performance and cost.
Test methodology
To simulate the common usage scenarios in big data applications, we designed two different test cases:
- Simple Read/Write
- Scenario: Simply verify the read/write throughput latency of the storage systems.
- Benchmark tools:
- DFSIO: TestDFSIO is the canonical example of a benchmark that attempts to measure the storage's capacity for reading and writing bulk data.
- TeraSort: a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system.
- Batch Analytics
- Scenario: To consistently execute analytical process to process large set of data.
- Benchmark tools: Leveraging 54 batch queries derived from TPC-DS with intensive reads across objects in different buckets. For testing dataset size, we generated 1 TB and 10 TB of data in the AFA cluster and 100 TB of data in another HDD cluster. For all the tests, page cache was dropped before each run to eliminate performance impact.
Performance overview
Ad-hoc data queries are one important part of big data analytics, and its performance is critical for many Apache Hadoop users. Figure 4 shows the performance of these 54 queries with a 1 TB dataset on different backend storage devices.
Figure 4. Performance comparison of batch analytics on SSD and HDD clusters.
Our results showed significant performance improvement from Apache Hadoop 2.7.3/Spark* 2.1.1 to Apache Hadoop 2.8.1/Spark 2.2.0 (improvement in S3A), and that the overall batch analytics performance of a 10-node Intel® SSD cluster is almost on-par with a 60-node HDD cluster.
Performance comparison with remote HDFS
A big data analytics application running on remote HDFS compares most closely with running on a cloud object store, which also uses decoupled compute and storage. We compared S3 object storage and remote HDFS performance to understand potential performance regressions when running big data analytics workloads on cloud object storage.
Table 3. Remote HDFS Cluster Configuration
CPU | Intel® Xeon® processor E5-2699 v4 @ 2.2 GHz |
Memory | 4x 32 GB DDR4 2666 MHz |
NUC | 1x Intel® 82599 10 Gigabit Ethernet (GbE) Controller |
Storage | Data: 4x Intel® SSD DC S3510 2.0 TB |
Software Configuration | RHEL* 7.5, Hadoop version 2.8.1 |
Figure 5. Performance comparison with remote HDFS on 54 queries.
Figure 5 shows that with tunings applied, performance of big data analytics (concurrent read) on object storage is 17% slower in a best-case scenario than remote HDFS. But for DFSIO tests (large sequential read and write case) the gap is much smaller, and S3A object storage performance is even higher for write. Given the difference of these two implementations on transport efficiency (http vs. socket streaming), this gap is reasonable. However, when running write-intensive workloads like TeraSort, the performance gap might be bigger (see Figure 6).
Figure 6. DFSIO and TeraSort performance comparison of remote HDFS and S3A object storage.
Tunings and Optimization
Running big data workloads on the cloud is more complicated than on a traditional on-premises solution. It has a longer I/O stack because of different software layers: Spark/Hadoop MR/Hive; Yarn and the S3A adapter; and the cloud object storage gateway or proxy and the cloud object storage itself. Software stack tunings and optimization might be critical for performance; in this section, we will present the two most significant performance improvements from our tunings.
S3A HTTP requests issue and optimization
We observed callback failure errors in radosgw logs. Deep investigation showed it was because the S3A adaptor closed the timeout/re-accessed file connection when radosgw was still using this connection. This led to many dead connections in the radosgw process, and caused high CPU utilization and slow read response.
To resolve this issue, the new S3A file system client implemented an experimental feature called “input policies”, available since Apache Hadoop 2.8+ was released9. This configuration is quite similar to the POSIX fadvise() API. We modified the configuration “fs.s3a.experimental.input.fadvise” from “sequential” (default) to “random” to enable connection reuse and thus resolved the issue.
Using a slow HDD as the Ceph backend, there will be many TCP-IP connections between the S3A client and radosgw, which makes the radosgw response very slow. “fs.s3a.readahead.range” is the parameter used to control read ahead bytes during a seek() before closing and re-opening the S3 HTTP connection. Our tunings show by using “1024K” instead of the default “64K” value, the radosgw read performance was greatly improved. Figure 7 shows that this optimization reduced the query time from 8267 seconds to 2659 seconds.
Figure 7. Random read tuning performance comparison
Radosgw deployment optimization
A typical object storage cluster is not an optimal choice for running big data workloads on the cloud, since it requires additional services, including dedicated load balancer services and gateway services. We consolidated the gateway services and the storage services on the same physical nodes, and compared the performance impact. Figure 8 shows the differences of the two architectures.
Figure 8. Radosgw deployment: dedicated RGW and co-located with OSD
Figure 9 demonstrates that with more radosgw instances and round-robin DNS co-located with storage services, batch analytics workload performance was improved by 18%.
Figure 9. Query time comparison between load balance and DNS.
Figure 10 shows the performance of both 1 TB DFSIO while scaling a number of radosgw instances. We observed that 1 TB DFSIO performance stopped increasing after the radosgw number scaled up to 5, and that 10 TB ETL query time stopped decreasing after the radosgw number increased to 9. So, for a real production cluster, the number of gateway instances needs to be tuned to achieve optimal results.
Figure 10. Throughput of the different number of radosgw nodes for DFSIO write.
Spark shuffle optimization
To eliminate additional cost for compute nodes and simplify the deployment configurations, we also compared the performance when deploying a Spark shuffle device on a local disk versus on a Ceph RBD disk. Two RBD volumes were mounted on the compute node as the shuffle devices to replace local physical drives. The Ceph cluster was deployed on an AFA cluster, and based on our previous tests, Ceph AFA RBD volumes delivered a competitive performance (in terms of bandwidth and latency) compared with local volumes. Figure 11 shows that for batch analytics, using RBD as the shuffle device performance is pretty similar to using the local physical device as the shuffle device.
Figure 11. Batch analytics performance comparison on RBD and physical shuffle device
S3A upload optimization
As of Apache Hadoop 2.7, there is a new feature called “fast upload” in the S3A connector, which allows the clients to upload without having to wait for the output stream to close10. Fast upload offers three buffering mechanisms to use: disk, array, and bytebuffer. Figure 12 shows the performance comparison of different buffer mechanisms. It shows array/bytebuffer would improve DFSIO bandwidth significantly, as the temporary data will be buffered in memory. However, if we only have limited memory capacity, using these two parameters may cause memory issues. A careful configuration and tuning is required based on memory capacity and performance needs.
Figure 12. Performance comparison of fast upload buffering mechanism.
Future Work
Many companies are migrating enterprise applications to containers and microservices. This creates a new and compelling trend for cloud application deployment – serverless computing helps developers simplify their programming model, without the need to take care of resource allocation and operations, thus lowering the cost of deployment. With this trend, we’ve observed that it’s becoming increasingly popular to move compute services like Spark into containers or microservices like Kubernetes*11. How to integrate disaggregated storage with this serverless analytics framework remains an interesting topic to be investigated.
Meanwhile, our work also shows there is a potential performance gap for running big data workloads on the cloud compared with previous on-premises deployment. And this performance gap might be even bigger for certain types of workloads. For example, in write-intensive workloads, the performance gap between cloud-based big data with on-premise can’t be ignored. The gap is due to S3A not supporting transactional writes or atomic renames, requiring it to use a combination of operations instead of a simple metadata update to complete the job. An in-memory accelerator might help to close the performance gap. There are several projects like Alluxio*12, Apache Ignite* 13, and Apache Crail*14 that can be deployed as an acceleration layer between compute services and the disaggregated storage solutions. For the rename performance issues, there are some ongoing projects like S3A committer15 that focus on improving the rename performance. Our initial tests with the TeraGen workload showed the write performance is improved by ~2.5x on Ceph S3. The two optimizations could potentially improve the overall solution’s performance and will be our next step.
Summary
The discontinuity in big data infrastructure drives big data analytics on the cloud, which offers benefits including agility, flexibility, simplicity, and scalability. However, big data analytics on cloud performance is much lower compared with on-premises solutions. This work demonstrated big data analytics on the cloud with disaggregated object storage is functionality-ready with tunings and optimization, and the performance is on-par in certain workloads such as read-intensive workloads. Meanwhile, it also proves AFA-based storage could significantly improve big data workloads’ performance and reduce TCO.
However, S3A is not a file system. It lacks support for transaction writes and thus could lead to a significant performance gap for write-intensive workloads. Our work showed that even compared with remote HDFS, running SQL workloads on Ceph object storage cloud performance is 17% lower and the gap is even bigger with TeraSort workloads. We have explored several possible approaches to optimize and improve the performance, including an in-memory acceleration layer and the S3A connector’s new features to further improve the performance.
Part of this work is a joint effort between Intel, Red Hat* and QCT. This works presents more details on the journey to the performance we could achieve today. Our previous work includes “Ceph Data Lake solution for Bigdata”16 on Cephalcon and the Intel and a Red Hat joint talk “Unlock Bigdata Analytic Efficiency with Ceph Data Lake”17 at the Vancouver OpenStack* Summit. Red Hat also published a series of blogs to explain why we should move to Ceph and what the performance looks like in their testing 18.
Reference
- IDC report Data Age 2025, sponsored by Seagate
- Principles of Data Gravity
- Ceph
- Intel® SSD DC P3520 Series
- Filesystem Compatibility with Apache Hadoop
- HCFS Comparison
- Intel® SSD DC S3510 Series
- Intel® SSD DC P3700 Series
- S3A Experimental “fadvise” input policy support
- Stabilizing S3A Fast Upload
- Kubernetes
- Alluxio
- Apache Ignite*
- Apache Crail*
- S3A Committer
- Cephalcon: Unlock Bigdata Analytic Efficiency with Ceph Data Lake, Zhang Jian, Fu Yong.
- Openstack Summit: Unlock bigdata analytic efficiency with Ceph data lake.
- RedHat publication: Why Spark on Ceph? (Part 1 of 3), Brent Compton.