Intel Labs Together with MIT Presents Award-Winning Research on Data Systems and Machine Learning at ACM SIGMOD/PODS 2021

Highlights

  • The ACM Special Interest Group on Management of Data and Principles of Database Systems (ACM SIGMOD/PODS 2021) runs from June 20-25.

  • Intel Labs and MIT win best paper award on machine learning-based data management.

author-image

By

As a long-time sponsor, attendee, and contributor to the SIGMOD/PODS conference, the industry-leading gathering on large-scale data management, Intel Labs will present its latest research on data systems and machine learning (ML). This year’s conference will occur virtually from June 20-25 with a local event in Xi’an, Shaanxi, China. As part of the event, Intel will present several research contributions including collaborations with the Massachusetts Institute of Technology (MIT) and Microsoft, and co-chair a workshop on exploiting artificial intelligence (AI) techniques for data management. 

Intel Labs is proud to present its award-winning research resulting from joint work conducted through the MIT Data Systems and AI Lab (DSAIL) with partners Microsoft and Google. The lab was established to explore and create solutions for large-scale data systems and enterprise applications using AI. This year’s research is the culmination of three years of work at the DSAIL and includes a real-world application of machine learning techniques for database query optimization. 

The joint paper, “Bao: Making Learned Query Optimization Practical,” won the SIGMOD 2021 Best Paper Award in the Data Management track and demonstrates how Bao (BAndit Optimizer) can quickly learn strategies that improve end-to-end query execution performance. DSAIL researchers also received the SIGMOD 2021 Honorable Mention Award in the Industry track together with Microsoft for the paper, “Steering Query Optimizers: A Practical Take on Big Data Workload,” which illustrates one of the first real-world applications of Bao based on big data workloads at Microsoft.

In conjunction with this research, DSAIL also plans to share regular research updates on the MIT/Intel collaboration in a series of blogs. The first blog, “Announcing the Learned Indexing Leaderboard”, is followed by a post on “More Bao Results: Learned Distributed Query Optimization on Vertica, Redshift, and Azure Synapse.” A third post is planned to provide a general overview of ML for systems. Check this site for regular updates.

In addition, Intel’s work with Georgia Tech and Carnegie Mellon University is detailed in the paper, “Spitfire: A Three-Tier Buffer Manager for Volatile and Non-Volatile Memory.” The paper is one the first to discuss database buffer management on the Intel® Optane™ persistent memory and Intel® Optane™ Solid State Drive (SSD). It also shows how machine learning techniques can be used to automatically tune the policies for data migration across the storage hierarchy based on requirements of the workload.

Intel will also be involved in two workshops during the conference. In the first one, Intel will present a joint paper on “Resource-Efficient Database Query Processing on FPGAs,” with The Dresden University of Technology and SAP during the 17th International Workshop on Data Management on New Hardware (DaMoN 2021) on Monday, June 21. Intel will also co-chair The 4th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM) on Friday, June 25. 

The following is an overview of the papers featured at the conference.


Research Papers
 

Recent efforts applying machine learning techniques to query optimization have shown few practical gains due to substantive training overhead, inability to adapt to changes, and poor tail performance. Motivated by these difficulties, we introduce Bao (the Bandit optimizer). Bao takes advantage of the wisdom built into existing query optimizers by providing per-query optimization hints. 

Bao combines modern tree convolutional neural networks with Thompson sampling, a well-studied reinforcement learning algorithm. As a result, Bao automatically learns from its mistakes and adapts to changes in query workloads, data, and schema. Experimentally, we demonstrate that Bao can quickly learn strategies that improve end-to-end query execution performance, including tail latency, for several workloads containing long-running queries. In cloud environments, we show that Bao can offer both reduced costs and better performance compared with a commercial system.
 

The design of the buffer manager in database management systems (DBMSs) is influenced by the performance characteristics of volatile memory (e.g.., DRAM) and non-volatile storage (e.g., SSD). The key design assumptions have been that the data must be migrated to DRAM for the DBMS to operate on it and that storage is orders of magnitude slower than DRAM. But the arrival of new non-volatile memory (NVM) technologies (e.g., Intel Optane persistent memory) that are nearly as fast as DRAM invalidates these previous assumptions. 

Researchers have recently designed Hymem, a novel buffer manager for a three-tier storage hierarchy comprising of DRAM, NVM, and SSD. Hymem supports cache-line-grained loading and an NVMaware data migration policy. While these optimizations improve its throughput, Hymem suffers from two limitations. First, it is a single-threaded buffer manager. Second, it is evaluated on an NVM emulation platform. These limitations constrain the utility of the insights obtained using Hymem. In this paper, we present Spitfire, a multi-threaded, three-tier buffer manager that is evaluated on real NVM hardware. We introduce a general framework for reasoning about data migration in a multi-tier storage hierarchy. 

We illustrate the limitations of the optimizations used in Hymem on Intel Optane technology and then discuss how Spitfire circumvents them. We demonstrate that the data migration policy has to be tailored based on the characteristics of the devices and the workload. Given this, we present a machine learning technique for automatically adapting the policy for an arbitrary workload and storage hierarchy. Our experiments show that Spitfire works well across different workloads and storage hierarchies.


Industry Papers
 

In recent years, there has been tremendous interest in research that applies machine learning to database systems. Being one of the most complex components of a DBMS, query optimizers could benefit from adaptive policies that are learned systematically from the data and the query workload. Recent research has brought up novel ideas towards a learned query optimizer, however these ideas have not been evaluated on a commercial query processor or on large scale, real-world workloads. In this paper, we take the approach used by Marcus et al. in Bao and adapt it to SCOPE, a big data system used internally at Microsoft. 

Along the way, we solve multiple new challenges: we define how optimizer rules affect final query plans by introducing the concept of a rule signature, we devise a pipeline computing interesting rule configurations for recurring jobs, and we define a new learning problem allowing us to apply such interesting rule configurations to previously unseen jobs. We evaluate the efficacy of the approach on production workloads that include 150𝐾 daily jobs. Our results show that alternative rule configurations can generate plans with lower costs, and this can translate to runtime latency savings of 7 − 30% on average and up to 90% for a non-trivial subset of the workload.


Workshop Papers
 

Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding scheme for a particular column is an optimization problem that depends on the data, the query workload, and the underlying hardware. 

We introduce Learned Encoding Advisor (LEA), a learned approach to column encoding selection. LEA is trained on synthetic datasets with various distributions on the target system. Once trained, LEA uses sample data and statistics (such as cardinality) from the user’s database to predict the optimal column encodings. LEA can optimize for encoded size, query performance, or a combination of the two. Compared to the heuristic-based encoding advisor of a commercial column store on TPC-H, LEA achieves 19% lower query latency while using 26% less space.
 

FPGA technology has introduced new ways to accelerate database query processing, that often result in higher performance and energy efficiency. This is thanks to the unique architecture of FPGAs using reconfigurable resources to behave like an application-specific integrated circuit upon programming. The limited amount of these resources restricts the number and type of modules that an FPGA can simultaneously support. In this paper, we propose “morphing sort-merge”: a set of run-time configurable FPGA modules that achieves resource efficiency by reusing the FPGA’s resources to support different pipeline-breaking database operators, namely sort, aggregation, and equi-join. The proposed modules use dynamic optimization mechanisms that adapt the implementation to the distribution of data at run-time, thus resulting in higher performance. Our benchmarks show that morphing sort-merge reaches an average speedup of 5× compared to MonetDB.