Modern Data Analytics 2025
Organization: Prof. Dr. Dan Olteanu, Dr. Andrei Draghici, Dr. Haozhe Zhang, Christoph Mayer, Eden Chmielewski, and Yuchen He.
This seminar provides a deep dive into the recent research developments reshaping the core of modern database systems: query processing and optimization. The performance of virtually every data-driven application hinges on the database's ability to translate declarative queries into efficient, low-level execution plans. However, the sheer complexity of modern analytics, the demand for real-time results, and the scale of today's datasets are pushing classical, heuristic-based optimizers to their breaking point.
Learning outcome: The goal of the seminar is to expose the students to the recent trends in academia and industry on rethinking modern data analytics systems. The students will read and present research published in the top international venues in data management research, in particular ACM Special Interest Group on Management of Data (SIGMOD) and Very Large Data Bases (VLDB). Students will gain a deep understanding of the challenges and state-of-the-art solutions in query optimization, robust execution, and real-time analytics maintenance. The course will equip them to critically analyze and contribute to the development of next-generation, high-performance data systems.
Target audience: MSc in Software Engineering, Data Science and AI students.
Semester: This seminar will be offered in Fall 2025.
Teaching format: Each participant prepares a presentation based on a research paper; answers follow-up technical questions; reads the other papers in the seminar session; and actively participates in the technical discussions in the seminar. Each participant has a buddy, who will help improve their presentation by making suggestions for improvements and attending dry runs of the presentation. The best presentation of the seminar will be selected by the participants and receive a prize.
Registration: Please register as required by the department. In addition, please browse the papers mentioned below. In the kickoff meeting, the papers will be assigned to students, so make sure you get assigned to a paper you want.
Meetings: The first meeting will be on Thursday, September 18, 2025 from 10:15 to 12:00 in room BIN 1.D.29. The meeting will feature a presentation by the organizers overviewing the topics to be investigated in the seminar and it will answer questions from the participants. In this session, students will be assigned to papers.
The student presentations will take place on Saturday November 8 and November 22, 2025 in BIN 2.A.01.
Participation at all three meetings is compulsory. The assessment depends on the quality of the presentation, active participation during the seminar, and input as a buddy.
How to read papers and give talks
How to read papers:
- Focus questions to help identify the main contributions of a paper
- Survival kit includes tips on how to read technical sections and the "three-pass approach" to tie all together
- Reading Research Papers by Andrew Ng
How to give talks:
- These two articles have a number of good suggestions.
- This video is pretty good as well.
- How To Speak by Patrick Winston - a newer version of Patrick's talk
Slides from the kick-off meeting
Here are the slides from the kick-off meeting Introduction slides. Bellow you can find the assignments for the two presentation days. If you want to get in contact with your supervisor, here is our list of emails:
Presentations for November 8th
- How Good are Query Optimizers, Really? Still Asking: How Good Are Query Optimizers, Really?
- Presented by: Müge Yegin
- Buddy: Xinyao Cao
- Supervisor: Christoph Mayer
- SQLStorm: Taking Database Benchmarking into the LLM Era
- Presented by: Xinyao Cao
- Buddy: Noah Croes
- Supervisor: Dan Olteanu
- How Good are Learned Cost Models, Really? Insights from Query Optimization Tasks
- Presented by: Noah Cores
- Buddy: Müge Yegin
- Supervisor: Andrei Draghici
- SafeBound: A Practical System for Generating Cardinality Bounds
- Presented by: Michael Sigg
- Buddy: Sofoklis Strompolas
- Supervisor: Yuchen He
- Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server
- Presented by: Sofoklis Strompolas
- Buddy: Michael Sigg
- Supervisor: Eden Chmielewski
- DPconv: Super-Polynomially Faster Join Ordering
- Presented by: Lihui Zhou
- Buddy: Annamaria Vass
- Supervisor: Yuchen He
- How to Optimize SQL Queries? A Comparison Between Split, Holistic, and Hybrid Approaches
- Presented by: Annamaria Vass
- Buddy: Lihui Zhou
- Supervisor: Eden Chmielewski
Presentations for Saturday 22snd
- Robust Join Processing with Diamond Hardened Joins
- Presented by: Marcelina Suszczyk
- Buddy: Elif Deniz İșbuğa
- Supervisor: Yuchen He
- SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning
- Presented by: Philipp Stoffel
- Buddy: Uros Dimitrijevic
- Supervisor: Christoph Mayer
- ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement Learning
- Presented by: Uros Dimitrijevic
- Buddy: Nishant Kumar
- Supervisor: Christoph Mayer
- Holistic query Approximation via RL Modeling
- Presented by: Nishant Kumar
- Buddy: Philipp Stoffel
- Supervisor: Andrei Draghici
- Query running too slow? Rewrite it with Quorion!
- Presented by: Elif Deniz İșbuğa
- Buddy: Marcelina Suszczyk
- Supervisor: Eden Chmielewski
- Streaming View: An Efficient Data Processing Engine for Modern Real-time Data Warehouse of Alibaba Cloud
- Presented by: Birghton Thomas
- Buddy: Akos Istvan Imets
- Supervisor: Haozhe Zhang
- Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables
- Presented by: Akos Istvan Imets
- Buddy: Lin Han
- Supervisor: Dan Olteanu
- Automated generation of materialized views in oracle
- Presented by: Lin Han
- Buddy: Brighton Thomas
- Supervisor: Haozhe Zhang
The following papers are left here to provide a broader context
Topic 1: Benchmarks for Query Optimization
- SQLStorm: Taking Database Benchmarking into the LLM Era
- How Good are Learned Cost Models, Really? Insights from Query Optimization Tasks
- The Accuracy of Cardinality Estimators: Unraveling the Evaluation Result Conundrum
Further Reading:
- The UDFBench Benchmark for General-purpose UDF Queries
- An Elephant Under the Microscope: Analyzing the Interaction of Optimizer Components in PostgreSQL
- Athena: An Effective Learning-based Framework for Query Optimizer Performance Improvement
- An Adaptive Benchmark for Modeling User Exploration of Large Datasets
Topic 2: Cardinality Estimation
For everyone to read: Pessimistic Cardinality Estimation
- SafeBound: A Practical System for Generating Cardinality Bounds
- Cardinality Estimation for Having-Clauses
- COLOR: A Framework for Applying Graph Coloring to Subgraph Cardinality Estimation
- Path-centric Cardinality Estimation for Subgraph Matching
- Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server
Further Reading:
- Extensible Query Optimizers in Practice.
- Table Overlap Estimation through Graph Embeddings
- SPACE: Cardinality Estimation for Path Queries Using Cardinality-Aware Sequence-based Learning
- Cardinality Estimation of LIKE Predicate Queries using Deep Learning
- Data-Agnostic Cardinality Learning from Imperfect Workloads
Topic 3: Query Optimization
- DPconv: Super-Polynomially Faster Join Ordering
- How to Optimize SQL Queries? A Comparison Between Split, Holistic, and Hybrid Approaches
- PAR2QO: Parametric Penalty-Aware Robust Query Optimization
- Galley: Modern Query Optimization for Sparse Tensor Programs
- Selective Late Materialization in Modern Analytical Databases
Further Reading:
- Hydro: Adaptive Query Processing of ML Queries
- AJOSC: Adaptive join order selection for continuous queries on data streams
- Schema-Based Query Optimisation for Graph Databases
- Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and Quality
- Learned Offline Query Planning via Bayesian Optimization
Topic 4: Factorized Query Processing
- FDB: a query engine for factorised relational databases Graphflow: An Active Graph Database. (Paper, Blog Post)
- The ubiquity of large graphs and surprising challenges of graph processing: extended survey
- Robust Join Processing with Diamond Hardened Joins
- Adaptive factorization using linear-chained hash tables
Further Reading:
Topic 5: Query Processing using Reinforcement Learning
Topic 6: Robust Query Processing
- Looking ahead makes query plans robust: making the initial case with in-memory star schema data warehouse workloads
- Debunking the Myth of Join Ordering: Toward Robust SQL Analytics
- Query running too slow? Rewrite it with Quorion!
- Parachute: Single-Pass Bi-Directional Information Passing
- Yannakakis+: Practical Acyclic Query Evaluation with Theoretical Guarantees
Topic 7: Incremental View Maintenance
For everyone to read: Recent Increments in Incremental View Maintenance
- Streaming View: An Efficient Data Processing Engine for Modern Real-time Data Warehouse of Alibaba Cloud
- Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables
- DBSP: Automatic Incremental View Maintenance for Rich Query Languages, video
- Automated generation of materialized views in oracle