In this paper, we argue that in many “Big Data” applications, getting data into the system correctly and at scale via traditional ETL (Extract, Transform, and Load) processes is a fundamental roadblock to being able to perform timely analytics or make real-time decisions. The best way to address this problem is to build a new architecture for ETL which takes advantage of the push-based nature of a stream processing system. We discuss the requirements for a streaming ETL engine and describe a generic architecture which satisfies those requirements. We also describe our implementation of streaming ETL using a scalable messaging system (Apache Kafka), a transactional stream processing system (S-Store), and a distributed polystore (Intel’s BigDAWG), as well as propose a new time-series database optimized to handle ingestion internally...
Authors
John Meehan
Cansu Aslantas
Stan Zdonik
Jiang Du
Related Content
Learning Compact Geometric Features
We present an approach to learning features that represent the local geometry around a point in an unstructured point cloud...
Playing for Benchmarks
We present a benchmark suite for visual perception. The benchmark is based on more than 250K high-resolution video frames, all...
End-to-End Driving via Conditional Imitation Learning
Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles. However, driving policies trained...
Colored Point Cloud Registration Revisited
We present an algorithm for aligning two colored point clouds. The key idea is to optimize a joint photometric and...