Modern Data Analytics 2020
Organization: Prof. Dr. Dan Olteanu, Prof. Dr. Michael Böhlen
This seminar overviews recent research development at the intersection of databases and machine learning. In particular, it considers two distinct lines of work:
- The application of machine learning to databases: Use models to predict query performance or replace traditional modules in a database management system such as indices.
- The application of databases to machine learning: Use database techniques to improve the runtime performance for training machine learning models.
Learning outcome: The goal of the seminar is to expose the students to the recent trends in academia and industry on rethinking database management systems and on how to effectively unify knowledge on both machine learning and databases to scale data science workloads.
Target audience: MSc in Data Science students (the maximum number of students is restricted to 18)
Semester: This seminar will be offered in Fall 2020.
Teaching format: Each participant writes a self-contained report of about 10 pages, gives a 30 minutes presentation, and answers follow-up technical questions in a 15-minutes Q&A session. Each participant has a buddy. Buddies read the report, make suggestions for improvements, and help with the presentation (e.g., dry runs). The first version of the report is due three weeks before the date of the presentation. This first version of the report and presentation will be discussed with the buddy and the teacher about two weeks before the presentation. The final versions of the report are due one week before the presentation.
Registration: Please register using the following form. This form will send your data (name, major, choice of three research papers from the list below) to seminar organizers. Confirmation of participation and assigned research paper will be received by email and the list below will be updated to highlight the papers that remain available. There are still open places!
Meetings: The first meeting will be on Monday, September 14, 2020 from 17:00 until 18:30. This will be in room BIN 2.A.10. The meeting will feature a presentation by the organizers overviewing the topics to be investigated in the seminar and it will answer questions from the participants. The slides of the first meeting can be found here (PDF, 183 KB).
The student presentations will take place on Saturdays December 5 and 12, 2020 in Room BIN 2.A.01.
Participation at all three meetings is compulsory. The assessment depends on the quality of the report, presentation, active participation during the seminar, and input as a buddy.
How to read papers and give talks
How to read papers:
- Focus questions to help identify the main contributions of a paper
- Survival kit includes tips on how to read technical sections and the "three-pass approach" to tie all together
- Reading Research Papers by Andrew Ng
How to give talks:
- These two articles have a number of good suggestions.
- This video is pretty good as well.
- How To Speak by Patrick Winston - a newer version of Patrick's talk
Papers to be read by all students
- Kick-off Meeting Slides (PDF, 183 KB)
- MLSys Whitepaper
- A Few Useful Things to Know About Machine Learning
The following are individual paper assignments organized by topics. Whenever an entry has two papers, this means that both papers can be presented together (as they use similar ideas), or only one of them can be presented.
Topic 1: System Perspective on Machine Learning Life-Cycle
- 1.1 Hidden Technical Debt in Machine Learning Systems
- 1.2 On Challenges in Machine Learning Model Management
- 1.3 Accelerating the Machine Learning Lifecycle with MLflow
- 1.4 TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
- 1.5 MODELDB: Opportunities and Challenges in Managing Machine Learning Models,
MODELDB: A System for Machine Learning Model Management
Topic 2: Learning Data Structures used in Database Systems
- 2.1 The Case for Learned Index Structures
- 2.2
Learning Multidimensional Indexes
LISA: A Learned Index Structure for Spatial Data - 2.3 ALEX: An Updatable Adaptive Learned Index
- 2.4
Learning Data Structure Alchemy
The Periodic Table of Data Structures - 2.5 Learned Cardinalities: Estimating Correlated Joins with Deep Learning
Topic 3: In-database Machine Learning and Linear Algebra
- 3.1 Materialization optimizations for feature selection workloads
- 3.2 The MADlib Analytics Library or MAD Skills, the SQL
- 3.3 Learning Linear Regression Models over Factorized Joins
- 3.4 A Layered Aggregate Engine for Analytics Workloads
- 3.5 Rk-means: Fast Clustering for Relational Data
- 3.6 LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation
Paper Assignments, Buddies, and Supervisors
Paper | Name | Buddy | Supervisor |
---|---|---|---|
1.1 | Yang Menz | Maximilian Tornow | Dan Olteanu |
1.2 | Joel Leupp | Xiaozhe Yao | Dan Olteanu |
1.3 | Padmapriya Raghu | Aditi Thakur | Dan Olteanu |
1.4 | Aditi Thakur | Padmapriya Raghu | Dan Olteanu |
1.5 | Xiaozhe Yao | Joel Leupp | Ahmet Kara |
2.1 | Clive Charles Javara | Syed Shahvaiz Ahmed | Ahmet Kara |
2.2 | Dominique Hässig | Ankush Panwar | Ahmet Kara |
2.3 | Syed Shahvaiz Ahmed | Clive Charles Javara | Ahmet Kara |
2.4 | Maximilian Tornow | Yang Menz | Michael Böhlen |
2.5 | Ankush Panwar | Dominique Hässig | Michael Böhlen |
3.1 | Ming Yi | YingYing Chen | Michael Böhlen |
3.2 | Ratanak Hy | Nazim Bayram | Michael Böhlen |
3.3 | Yu Linghu | Christoph Mayer | Nils Vortmeier |
3.4 | YingYing Chen | Ming Yi | Nils Vortmeier |
3.5 | Christoph Mayer | Yu Linghu | Nils Vortmeier |
3.6 | Nazim Bayram | Ratanak Hy | Nils Vortmeier |
Papers 1.1 to 2.3 will be presented on Saturday, December 5, 2020
Papers 2.4 to 3.6 will be presented on Saturday, December 12, 2020