Expand your Training Limits! Generating Training Data for ML-based Data Managemen

418 Aufrufe

Danke für die Bewertung! Teile es mit Deinen Freunden!

Danke für Deine Bewertung!

Published Apr 7, 2021

Demo of DataFarm, an innovative framework for efficiently generating and labeling large query workloads

DataFarm is result of the following work:

„Expand your Training Limits! Generating Training Data for ML-based Data Management” Francesco Ventura (Politechnico di Torino*), Zoi Kaoudi (TU Berlin / DFKI GmbH), Jorge-Arnulfo Quiané-Ruiz (TU Berlin / DFKI GmbH), and Volker Markl (TU Berlin / DFKI GmbH) Accepted for presentation at the 2021 ACM SIGMOD/PODS International Conference on Management of Data, Xi'an, Shaanxi, China

*Work done while interning at TU Berlin.

Abstract: Machine Learning (ML) is quickly becoming a prominent method in many data management components, including query optimizers which have recently shown very promising results. However, the low availability of training data (i.e., large query workloads with execution time or output cardinality as labels) widely limits further advancement in research and compromises the technology transfer from research to industry. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries/jobs. In this work, we face the problem of generating training data for data management components tailored to users’ needs. We present DataFarm, an innovative framework for efficiently generating and labeling large query workloads. We follow a data-driven white box approach to learn from pre-existing small workload patterns, input data, and computational resources. Our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component. We show that our framework outperforms the current state-of-the-art both in query generation and label estimation using synthetic and real datasets. It has up to 9× better labeling performance, in terms of R2 score. More importantly, it allows users to reduce the cost of getting labeled query workloads by 54× (and up to an estimated factor of 104×) compared to standard approaches.

Download the preprint: https://www.redaktion.tu-berlin.de/fileadmin/fg131/Publikation/Papers/Ventura_Expand-your_training-limits_SIGMOD-2021_preprint.pdf

Database Systems and Information Management (DIMA) group, Technische Universität Berlin
https://www.dima.tu-berlin.de/menue/database_systems_and_information_management_group/

Intelligent Analytics for Massive Data (IAM) group, German Research Center for Artificial Intelligence (DFKI)
https://www.dfki.de/en/web/research/research-departments/intelligent-analytics-for-massive-data/

Politecnico di Torino
https://www.polito.it/

__________________________________
https://bifold.berlin/ @bifoldberlin
https://twitter.com/bifoldberlin