Using TPCx-BB Express Benchmark* for End-to-End Big Data Clusters

The TPCx-BB Express Benchmark* (TPCx-BB) is designed to measure the performance of big data analytics systems. The benchmark contains 30 use cases (queries) that simulate big data processing, big data storage, big data analytics, and reporting. Our team used this benchmark to evaluate the performance of software and hardware components for big data clusters. We find the benchmark has good coverage for different data types. The benchmark also provides enough scalability to address challenges of scaling data size and nodes. We have gained key insights into designing big data analytic systems by using TCPx-BB.

We do need more than TPCx-BB to evaluate and design complete, end-to-end big data systems. That's because there is a difference between an analytics system and a real-world, end-to-end system. For example, the data flow of an end-to-end system should include data ingestion.

Data ingestion moves data from where it originates in a system (such as Apache Hadoop*) to where it can be stored and analyzed. Importing that data at a reasonable speed can be challenging for businesses that want to maintain a competitive advantage. However, TPCx-BB was not designed to evaluate the performance of software and hardware for data ingestion. Consider the three dimensions of big data: volume, variety, and velocity. Velocity refers to the high speed of data processing: real time or near real time. Unfortunately, with TPCx-BB, there is a strict limitation on bandwidth and latency for real-time processing.

This paper discusses our experiences and lessons learned using TPCx-BB to evaluate the performance of software and hardware for real-time processing. We then offer advice on how to extend TPCx-BB to evaluate data ingestion and real-time processing. Finally, we share some ideas on how to implement fuller TPCx-BB coverage for end-to-end big data clusters.