This year’s VLDB conference, now in its 47th year, is sure to be inspirational for the database management industry. VLDB 2021 will cover data management, database, and information systems research, technologies considered to be the most important for emerging applications of the 21st century. Intel, a silver sponsor, is pleased to present eight co-authored papers, in addition to presenting lectures and a demonstration on its latest research in data management systems.
In addition, Intel researchers will participate in co-located events including:
- Poly’21 (Polystore Systems for Heterogeneous Data in Multiple Databases with Privacy Security Assurances)
- AIDB 2021 (Applied AI for Database Systems and Applications)
- LADSIOS (Learned Algorithms, Data Structures, and Instance-Optimized Systems).
Two papers co-written by researchers from Intel Labs and the Massachusetts Institute of Technology (MIT) focuses on exploiting machine learning techniques for optimizing the performance of database systems. The papers are part of the co-funded research being conducted between Intel and MIT’s Data Systems and AI Lab (DSAIL).
The first paper, “Benchmarking Learned Indexes,” introduces the Search on Sorted Data Benchmark (SOSD), the first public benchmark on comparing learned index structures against traditional ones using real-world datasets. The second paper, “Flow-Loss: Learning Cardinality Estimates That Matter,” introduces Flow-Loss, a novel technique for ML-based cardinality estimation, which is a core task in database query optimization.
Intel will also present a paper and a demonstration on another new public benchmark called Exathlon. Exathlon, developed in collaboration with Ecole Polytechnique in France, introduces the first comprehensive benchmark for explainable anomaly detection over high-dimensional time series data. The benchmark provides open access to a new community resource for supporting reproducible research and experimentation.
Nesime Tatbul, senior research scientist at Intel Labs and MIT, will participate in a round table discussion on “Reproducibility and/or Availability,” at VLDB. Principal Artificial Intelligence (AI) Scientist and Director and Founder of Machine Programming Research at Intel Labs, Justin Gottschlich, will give the industry keynote address at LADSIOS on “Machine Programming and the Future of Software Development.” Justin will discuss how the Intel Labs’ Machine Programming (MP) research team is working toward new ways to automatically develop software and give examples of recently built MP systems that demonstrate state-of-the-art performance.
In addition, Computer Science Postdoctoral Researcher at MIT and Researcher at Intel Labs, Ryan Marcus, will give a tutorial on “Lessons Learned from Three Years of Learned Query Optimization.” Ryan will also present his paper, “Towards Practical Learned Indexing,” at the AIDB Workshop.
Following is a list of all Intel contributions in the VLDB’21 program:
Workshops and Speaking Engagements:
Monday, August 16
11:40 a.m. (CET)
Tutorial: “Lessons Learned from Three Years of Learned Query Optimization”
Ryan Marcus, CS Postdoctoral Researcher at MIT and Researcher, Intel Labs
1:00 p.m. (CET)
Keynote: “Machine Programming and the Future of Software Development”
Justin Gottschlich, Principal AI Scientist and Director and Founder of Machine Programming Research, Intel Labs
Thursday, August 19
10:00-11:00 p.m. (CET)
Roundtable: “Reproducibility and/or Availability”
Nesime Tatbul, Senior Research Scientist at Intel Labs and MIT
AIDB 2021 Workshop
Friday, August 20
Lecture: “Towards Practical Learned Indexing”
Ryan Marcus, Computer Science Postdoctoral Researcher at MIT and Researcher, Intel Labs
- Benchmarking Learned Indexes
Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art “traditional” baselines. Using four real-world datasets, we demonstrate that learned index structures can indeed outperform non-learned indexes in read-only in-memory workloads over a dense array.
We investigate the impact of caching, pipelining, dataset size, and key size. We study the performance profile of learned index structures, and build an explanation for why learned models achieve such good performance. Finally, we investigate other important properties of learned index structures, such as their performance in multi-threaded systems and their build times.
- Flow-Loss: Learning Cardinality Estimates That Matter
Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer’s cost model and dynamic programming search algorithm with analytical functions.
A reduction of query optimization to a flow routing problem on a certain plan graph in which paths correspond to different query plans is at the heart of Flow-Loss. To evaluate our approach, we introduce the Cardinality Estimation Benchmark (CEB), which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost model) and query runtimes despite having worse estimation accuracy than a model trained with QError.
When the test set queries closely match the training queries, both models improve performance significantly over PostgreSQL and are close to the optimal performance (using true cardinalities). However, the Q-Error trained model degrades significantly when evaluated on queries that are slightly different (e.g., similar but not identical query templates), while the Flow-Loss trained model generalizes better to such situations. For example, the Flow-Loss model achieves up to 1.5X better runtimes on unseen templates compared to the Q-Error model, despite leveraging the same model architecture and training data.
- Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series
Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many experimental research domains. While advanced analytics tasks over time series data have been gaining lots of attention, lack of such community resources severely limits scientific progress. In this paper, we present Exathlon, the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data. Exathlon has been systematically constructed based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster.
Some of these executions were intentionally disturbed by introducing instances of six different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of the anomaly instances, ground truth labels for the root cause interval and for the extended effect interval are provided. The anomaly instances support the development and evaluation of a wide range of anomaly detection (AD) and explanation discovery (ED) tasks. We demonstrate the practical utility of Exathlon’s dataset, evaluation methodology, and end-to-end data science pipeline design through an experimental study with three state-of-the-art AD and ED techniques.
- ParaX: Boosting Deep Learning for Big Data Analytics on Many-Core CPUs
Despite the fact that GPUs and accelerators are more efficient in deep learning (DL), commercial clouds now heavily use CPUs in DL computation because there are large numbers of CPUs which would otherwise sit idle during off-peak periods. Following the trend, CPU vendors have not only released high-performance many core CPUs but also developed efficient math kernel libraries. However, current platforms cannot scale well to a large number of CPU cores, making many core CPUs inefficient in DL computation.
We analyze the memory access patterns of various layers and identify the root cause of the low scalability. For example, these include the per-layer barriers that are implicitly imposed by current platforms, which assign one single instance (i.e., one batch of input data) to a CPU. The barriers cause severe memory bandwidth contention and CPU starvation in the access-intensive layers. This paper presents a novel approach called ParaX, which boosts the performance of DL on many core CPUs by effectively alleviating bandwidth contention and CPU starvation.
Our key idea is to assign one instance to each CPU core instead of to the entire CPU, to remove the per-layer barriers on the executions of the many cores. ParaX designs an ultra-light scheduling policy, which sufficiently overlaps the access-intensive layers with the compute-intensive ones to avoid contention. It proposes a NUMA-aware gradient server mechanism for training, which leverages shared memory to substantially reduce the overhead of per-iteration parameter synchronization. Extensive evaluation shows that ParaX achieves significant acceleration for DL on many core CPUs.
- Declarative Data Serving: The Future of Machine Learning Inference on the Edge
Recent advances in computer architecture and networking have ushered in a new age of edge computing, where computation is placed close to the point of data collection to facilitate low-latency decision making. As the complexity of such deployments grows into networks of interconnected edge devices, getting the necessary data to be in “the right place at the right time” can become a challenge. We envision a future of edge analytics where data flows between edge nodes are declaratively configured through high-level constraints.
Using machine-learning model-serving as a prototypical task, we illustrate how the heterogeneity and specialization of edge devices can lead to complex, task-specific communication patterns even in relatively simple situations. Without a declarative framework, managing this complexity will be challenging for developers and will lead to brittle systems. We conclude with a research vision for database community that brings our perspective to the emergent area of edge computing.
- Optimizing In-memory Database Engine For AI-powered On-line Decision Augmentation Using Persistent Memory
Online decision augmentation (OLDA) has been considered as a promising paradigm for real-time decision making powered by Artificial Intelligence (AI). OLDA has been widely used in many applications such as real-time fraud detection, personalized recommendation, etc. Online inference puts real-time features extracted from multiple time windows through a pre-trained model to evaluate new data to support decision making. Feature extraction is usually the most time-consuming operation in many OLDA data pipelines.
In this work, we started by studying how existing in-memory databases can be leveraged to efficiently support such real-time feature extractions. However, we found that existing in-memory databases cost hundreds or even thousands of milliseconds. This cost and setup is unacceptable for OLDA applications with strict real-time constraints. We therefore propose Feature Engineering Database (FEDB), a distributed in-memory database system designed to efficiently support on-line feature extraction.
Our experimental results show that FEDB can be one to two orders of magnitude faster than the state-of-the-art in-memory databases on real-time feature extraction. Furthermore, we explore the use of the Intel Optane DC Persistent Memory Module (PMEM) to make FEDB more cost-effective. When comparing the proposed PMem-optimized persistent skip list to the FEDB using DRAM+SSD, PMEM-based FEDB can shorten the tail latency up to 19.7%. In addition, it can reduce the recovery time up to 99.7%, and save up to 58.4% total cost of a real OLDA pipeline.
- Using VDMS to Index and Search 100M Images
Data scientists spend most of their time dealing with data preparation, rather than doing what they know best: build machine learning models and algorithms to solve previously unsolvable problems. In this paper, we describe the Visual Data Management System (VDMS) and demonstrate how it can be used to simplify the data preparation process and consequently gain in efficiency simply because we are using a system designed for the job.
To demonstrate this, we use one of the largest available public datasets (YFCC100M), with 100 million images and videos, plus additional data including machine-generated tags, for a total of about ∼12TB of data. VDMS differs from existing data management systems due to its focus on supporting machine learning and data analytics pipelines that rely on images, videos, and feature vectors, treating these as first-class citizens.
We demonstrate how VDMS outperforms well-known and widely used systems for data management by up to ∼364x, with an average improvement of about 85x for our use-cases, and particularly at scale, for an image search engine implementation. At the same time, VDMS simplifies the process of data preparation and data access and provides functionalities non-existent in alternative options.
- A Demonstration of the Exathlon Benchmarking Platform for Explainable Anomaly Detection
In this demo, Intel introduces Exathlon – a new benchmarking platform for explainable anomaly detection over high-dimensional time series. Intel designed Exathlon to support data scientists and researchers in developing and evaluating learned models and algorithms for detecting anomalous patterns as well as discovering their explanations. This demo will showcase Exathlon’s curated anomaly dataset, novel benchmarking methodology, and end-to-end data science pipeline in action via example usage scenarios.