Navigation auf uzh.ch

This course gives a unifying overview of the latest research in efficient computation over structured data, with applications spanning databases, artificial intelligence and machine learning, algorithms, and linear algebra. Besides their theoretical interest, algorithms overviewed in this course represent a key differentiator for commercial database and relational AI engines and thus essential knowledge for the future data system engineer.

This is an advanced course with lectures, exercises, and a practical task. It targets Computer Science students at MSc and PhD levels.

As part of the course, the students will learn how to formalise problems using various algebraic formalisms, analyse their computational complexity using a toolbox of techniques that exploit the algebraic and combinatorial structure of the problem, familiarise themselves with simple yet powerful algorithms that exploit such structure, and finally implement and benchmark such algorithms.

- Introduction [slides]
- Why Take This Course?
- Sample of Considered Problems:
- DB: Query evaluation
- CSP: Constraint satisfaction problems
- SAT, count-SAT: Satisfiability problems
- PGM: Inference and learning in probabilistic graphical models
- LA: Matrix chain multiplication, discrete Fourier transform in linear algebra
- ML: Cost functions and gradients for learning over structured data, einsum notation

- Uniform Approach to Computational Problems
- Course Organisation

- Commutative Semirings [slides]
- Definition, Benefits: Generalisation and Reduced Complexity
- Examples: Boolean, Sum-Product, Max-Product, Polynomial, Inner-Product
- Sample of Problems (and their Semirings)
- Algebraic Path Problems via Semiring Fixpoint Equation: Shortest Distance (Min-Sum), Connectivity and Largest Capacity (Max-Min), Maximum Reliability (Max-Product), Language Accepted by Automaton (Union-Concatenate)
- Satisfiability: Map Colouring (Boolean)
- Database Query Evaluation: Factorised Joins (Union-Product), Factorised Aggregate-Joins (Sum-Product)
- Medical Diagnosis with Bayesian Networks: Marginal Distribution (Sum-Product), MAP (Max-Product )

- FAQs: Functional Aggregate Queries [slides]
- Hypergraphs
- Definition and Examples
- Expressing Problems in FAQ
- Algebraic Path Problems
- (Count)Satisfiability
- k -Colorability
- Conjunctive Queries, Natural Join Queries
- Matrix Chain Multiplication
- Marginal Distribution and MAP in in Probabilistic Graphical Models

- Map of Expressing Problems in FAQ

- Decompositions [slides]
- FAQ compute time depends on the structure of its hypergraph
- Hypertree decompositions
- Acyclic hypergraphs: alpha-acyclic and beta-acyclic, the GYO algorithm, free-connex property
- Join trees: Constructing join trees for alpha-acyclic hypergraphs
- Examples using acyclic, triangle, bowtie, grid, 4/6-cycle, clique, Loomis-Whitney hypergraphs

- Width Measures [slides]
- Challenge of choosing a good hypertree decomposition
- Treewidth
- Edge covers and independent sets: integer programs, relaxations as linear programs, fractional edge cover number, refinement under cardinality constraints
- Hypertree width and fractional hypertree width
- Widths in the presence of free variables
- Examples using acyclic, triangle, bowtie, grid, 4/6-cycle, clique, Loomis-Whitney hypergraphs

- Solving Joins Optimally [slides]
- Instance output size versus worst-case output size
- Yannakakis algorithm for alpha-acyclic conjunctive queries
- Worst-case optimal join algorithms: LeapFrog TrieJoin Join
- Suboptimality of state-of-the-art join algorithms for cyclic queries
- Efficient processing for queries with large output
- Trade-off between preprocessing time and enumeration delay
- Query classes: conjunctive, alpha acyclic, free-connex alpha acyclic

- Worst-Case Optimal Size Bounds for Joins [slides]
- Upper Bound
- Warm-up: Triangle join
- Information theory: Random variables, (joint, conditional) entropy, Shannon inequalities, chain rule, Shearer's lemma
- Connection of Shearer's lemma with fractional edge cover number

- Lower Bound
- Warm-up: Triangle join
- Dual linear program for fractional edge cover number
- Construction of factors to meet the desired lower bound

- Same input size vs different input sizes

- Upper Bound
- Solving SAT [slides]
- SAT encoded as FAQ
- Clause representation
- The DP procedure: single-phase variables, tautological clauses, unit-clause contradictions, resolution, equi-satisfiability
- The DPLL procedure: single-phase variables, unit propagation, literal marginalisation
- DP and DPLL seen through glasses of FAQ solver
- Alpha acyclic SAT is hard
- Beta acyclic SAT is tractable
- Alternative characterisation of beta acyclicity via nested inclusion chains
- Variable elimination orders

- Solving Functional Aggregate Queries [slides]
- FAQ over one semiring
- Rewriting FAQs into alpha-acyclic FAQs
- Variable marginalisation orders
- Indicator projections
- Moving marginalisation past products

- FAQ over multiple semirings
- Product aggregates, powering factors
- Marginalisation orders for variables over different semirings

- The InsideOut algorithm
- The FAQ width

- FAQ over one semiring
- Solving Queries under Updates [slides]
- Dynamic query evaluation trade-offs: preprocessing time, enumeration delay, update time
- Delta queries
- Delta view trees
- Landcape of dynamic evaluation
- Static versus dynamic width
- Hierarchical queries
- Trade-offs between update time and enumeration delay
- Optimality of updates for hierarchical queries with dynamic width 0 and 1

The practical component of the course consists of the following three implemnentation tasks:

- Implement the worst-cast optimal join algorithm Leapfrog Triejoin (LFTJ)
- Benchmark its performance on public and private joins and databases
- Extend LFTJ with semiring computation to support aggregates over joins, e.g., count queries, queries with projections, and covariance matrix

To guide the implementation effort, a leaderboard will be maintained with the runtime performance scores of your query evaluation engines on private databases. The code can be submitted regularly and it gets compiled and tested automatically. The feedback will be provided on correctness and runtime performance. If the correctness test is passed, then then runtime performance is added to the leaderboard. Leaderboard will guide the award of bonus points to students whose code performs above average.

- Background in Computer Science at BSc level, in particular: Foundations of Computing I + II, Informatik II
- Familiarity with a programming language, e.g., C/C++, Java, Scala, Julia, or Python, needed for the practical task