My main research interests are:
- Data Provenance
- Data Mining
Other database research areas I'm interested in are:
- Temporal Databases
- Database Implementation
- Indexstructures
- Database Modeling
- Perm - Provenance computation for relational queries
- sesamDB: database design and implementation for a longitudinal study
2010
2009
-
Boris Glavic, Gustavo Alonso, Perm: Processing provenance and data on the same data model through query rewriting, ICDE '09: Proceedings of the 25th International Conference on Data Engineering 2009. (inproceedings)
Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data
items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.
-
Boris Glavic, Gustavo Alonso, Provenance for Nested Subqueries, EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology 2009. (inproceedings)
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that
provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.
In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.
-
Boris Glavic, Gustavo Alonso, The Perm Provenance Management System in Action, SIGMOD '09: International Conference on Management of Data (demonstration) 2009. (inproceedings)
In this demonstration we present the Perm provenance management system (PMS). Perm
is capable of computing, storing and querying provenance information for the relational data model.
Provenance is computed by using query rewriting techniques to annotate tuples with provenance information.
Thus, provenance data and provenance computations are represented as relational data and queries and, hence,
can be queried, stored and optimized using standard relational database techniques. This demo shows the complete
Perm system and lets attendants examine in detail the process of query rewriting and provenance retrieval
in Perm, the most complete data provenance system available today. For example, Perm supports lazy
and eager provenance computation, external provenance
and various contribution semantics.
2008
2007
-
Boris Glavic, Klaus R. Dittrich, Data Provenance: A Categorization of Existing Approaches, BTW '07: 12. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web, March 2007, Verlagshaus Mainz, Aachen. (inproceedings)
In many application areas like e-science and data-warehousing detailed
information about the origin of data is required. This kind of information is
often referred to as data provenance or data lineage. The provenance of a data
item includes information about the processes and source data items that lead
to its creation and current representation. The diversity of data
representation models and application domains has lead to a number of more or
less formal definitions of provenance. Most of them are limited to a special
application domain, data representation model or data processing facility. Not
surprisingly, the associated implementations are also restricted to some
application domain and depend on a special data model. In this paper we give a
survey of data provenance models and prototypes, present a general
categorization scheme for provenance models and use this categorization scheme
to study the properties of the existing approaches. This categorization enables
us to distinguish between different kinds of provenance information and could
lead to a better understanding of provenance in general. Besides the
categorization of provenance types, it is important to include the storage,
transformation and query requirements for the different kinds of provenance
information and application domains in our considerations. The analysis of
existing approaches will assist us in revealing open research problems in the
area of data provenance.
2006
-
Boris Glavic, Klaus R. Dittrich, sesam study team, sesam: Ensuring Privacy for a Interdisciplinary Longitudinal Study, Workshop Elektronische Datentreuhänderschaft - Anwendungen 2006, IT Verlag. (inproceedings)
Most medical, biological and social studies face the problem of storing
information about subjects for research purposes without violating the
subject's privacy. In most cases it is not possible to remove all information
that could be linked to a subject, because some of this information is needed
for the research itself. This fact holds especially for longitudinal studies,
which collect data about a subject at different times and places. Longitudinal
studies need to link different data about a specific subject, collected at
different times for research and administration use. In this paper we present
the security concept proposed for sesam, a longitudinal interdisciplinary
study that analyses the social, biological and psychological risk factors for
the development of psychological diseases. Our security concept is based on
pseudonymisation, encrypted data transfer and an electronic data custodianship.
This paper is mainly a case study and some of the security problems emerged in
the context of sesam may not occur in other studies. Nevertheless we
believe that an adopted version of our approach could be used in other
application scenarios as well.
-
Ira Assent, Ralph Krieger, Boris Glavic, Thomas Seidl, Spatial Multidimensional Sequence Clustering, SSTDM '06: Proc. 1st International Workshop on Spatial and Spatio-temporal Data Mining In conjunction with ICDM 2006. (inproceedings)
Measurements at different time points and positions in large temporal or spatial databases requires effective and efficient data mining techniques. For several parallel measurements, finding clusters of arbitrary length and number of attributes, poses additional challenges. We present a novel algorithm capable of finding parallel clusters in different structural quality parameter values for river sequences used by hydrologists to develop measures for river quality improvements.
2005
-
Boris Glavic, Subspace Sequence Clustering - Dataming zur Entscheidungsunterstützung in der Hydrologie., BTW '05: 10. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (Studierenden-Programm) 2005. (inproceedings)
RDF for all publications
BibTeX for all publications
Im Rahmen des Perm Projektes sind Diplomarbeiten, Masterarbeiten und Bachelorarbeiten möglich. Bei Interesse einfach per Mail, Telefon oder in meinem Büro melden.
| Email: |
Boris Glavic |
|
| Address: |
University of Zurich Department of Informatics Binzmühlestrasse 14 CH-8050 Zurich Switzerland |
|
| Office: |
2.E.12 |
|
| Tel.: |
+41-44-635-4329 |
|
| Fax: |
+41-44-635-6809 |
|
| Email: |
Boris Glavic |
|
| Address: |
sesam - Swiss ethiological study of adjustment and mental health University of Basel Birmannsgasse 8 CH-4009 Basel Switzerland |
|
| Office: |
119 |
|
| Tel.: |
+41-61-267-0281 |
|
| Fax: |
|
|