-
Andrej Taliun, Michael Böhlen, Arturas Mazeika, , CORE: Nonparametric Clustering of Large Numeric Databases, SDM 2009: Proceedings of the SIAM International Conference on Data Mining 2009. (inproceedings)
Current clustering techniques are able to identify arbitrarily shaped clusters in the presence of noise, but depend on carefully chosen model parameters. The choice of model parameters is difficult: it depends on the data and the clustering technique at hand, and finding good model parameters often requires time consuming human interaction. In this paper we propose CORE, a new nonparametric clustering technique that explicitly computes the local maxima of the density and represents them with cores. CORE proposes an adaptive grid and gradients to define and compute the cores of clusters. The incrementally constructed adaptive grid and the gradients make the identification of cores robust, scalable, and independent of small density fluctuations. Our experimental studies show that CORE without any carefully chosen model parameters produces better quality clustering than related techniques and is efficient for large datasets.
-
Michael Böhlen, Christian Jensen, Richard Snodgrass, Current Semantics, Encyclopedia of Database Systems: pages 544-545; ISBN 978-0-387-35544-3; 2009. (incollection)
-
Christoph Sturm, Ela Hunt, Marc H. Scholl, Distributed Privilege Enforcement in PACS, Data and Applications Security XXIII, 23rd Annual IFIP WG 11.3 Working Conference, Montreal, Canada, July 12-15, 2009. Proceedings 2009, Springer. (inproceedings)
We present a new access control mechanism for P2P networks
with distributed enforcement, called P2P Access Control System
(PACS). PACS enforces powerful access control models like RBAC with
administrative delegation inside a P2P network in a pure P2P manner,
which is not possible in any of the currently used P2P access control
mechanisms. PACS uses client-side enforcement to support the replication
of confidential data. To avoid a single point of failure at the time of
privilege enforcement, we use threshold cryptography to distribute the
enforcement among the participants. Our analysis of the expected number
of messages and the computational effort needed in PACS shows that
its increased flexibility comes with an acceptable additional overhead.
-
Olivier Wirz, Entwurf und Implementierung eines SQL-DDL-Präprozessors zur Unterstützung von Datenbankentwurfsmustern, March 2009. (diplomathesis)
Design patterns have become essential in software development. They identify
good design and define a vocabulary among developers. Recurring patterns also
exist in database schema design but are not as well documented as design pat-
terns of other areas in software development such as e.g. design patterns of
object-oriented development. This diploma thesis provides examples of design
patterns for schema creation for relational databases. We refer to this type of
patterns as Database Design Patterns. We present SQLPP, a SQL preprocessor
for macro processing. SQLPP enables users to store and reuse design patterns
for schema creation, whereas the design patterns are defined in macros.
-
Romans Kasperovics, Michael Böhlen, Johann Gamper, Evaluating Exceptions on Time Slices, ER 2009: 28th International Conference on Conceptual Modeling 2009. (inproceedings)
Public transport schedules contain temporal data with many regular patterns that can be represented compactly. Exceptions come as modifications of the initial schedule and break the regular patterns increasing the size of the representation. A typical strategy to preserve the compactness of schedules is to keep exceptions separately. This, however, complicates the automated processing of schedules and imposes a more complex model on applications. In this paper we evaluate exceptions by incorporating them into the patterns that define schedules. We employ sets of time slices, termed multislices, as a representation formalism for schedules and exceptions. The difference of multislices corresponds to the evaluation of exceptions and produces an updated schedule in terms of a multislice. We propose a relational model for multislices, provide an algorithm for efficient evaluating the difference of multislices, and show analytically and experimentally that the evaluation of exceptions is a feasible strategy for realistic schedules.
-
Michael Böhlen, Christian Jensen, Richard Snodgrass, Nonsequenced Semantics, Encyclopedia of Database Systems: pages 1913-1915; ISBN 978-0-387-35544-3; 2009. (incollection)
-
Juozas Gordevicius, Johann Gamper, Michael Böhlen, Parsimonious temporal aggregation, EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology 2009. (inproceedings)
Temporal aggregation is a crucial operator in temporal databases and has been studied in various flavors, including instant temporal aggregation (ITA) and span temporal aggregation (STA), each having its strengths and weaknesses. In this paper we define a new temporal aggregation operator, called parsimonious temporal aggregation (PTA), which comprises two main steps: (i) it computes the ITA result over the input relation and (ii) it compresses this intermediate result to a user-specified size c by merging adjacent tuples and keeping the induced total error minimal; the compressed ITA result is returned as the final result. By considering the distribution of the input data and allowing to control the result size, PTA combines the best features of ITA and STA. We provide two evaluation algorithms for PTA queries. First, the oPTA algorithm computes an exact solution, by applying dynamic programming to explore all possibilities to compress the ITA result and selecting the compression with the minimal total error. It runs in O(n2pc) time and O(n2) space, where n is the size of the input relation and p is the number of aggregation functions in the query. Second, the more efficient gPTA algorithm computes an approximate solution by greedily merging the most similar ITA result tuples, which, however, does not guarantee a compression with a minimal total error. gPTA intermingles the two steps of PTA and avoids large intermediate results. The compression step of gPTA runs in O(np log(c + ?)) time and O(c + ?) space, where ? is a small buffer for "look ahead". An empirical evaluation shows good results: considerable reductions of the result size introduce only small errors, and gPTA scales to large data sets and is only slightly worse than the exact solution of PTA.
-
Boris Glavic, Gustavo Alonso, Perm: Processing provenance and data on the same data model through query rewriting, ICDE '09: Proceedings of the 25th International Conference on Data Engineering 2009. (inproceedings)
Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data
items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.
-
Boris Glavic, Gustavo Alonso, Provenance for Nested Subqueries, EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology 2009. (inproceedings)
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that
provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.
In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.
-
Michael Böhlen, Christian Jensen, Sequenced Semantics, Encyclopedia of Database Systems: pages 2619-2621; ISBN 978-0-387-35544-3; 2009. (incollection)
-
Igor Timko, Michael Böhlen, Johann Gamper, Sequenced spatio-temporal aggregation in road networks, EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology 2009. (inproceedings)
Many applications of spatio-temporal databases require support for sequenced spatio-temporal (SST) aggregation, e. g., when analyzing traffic density in a city. Conceptually, an SST aggregation produces one aggregate value for each point in time and space.
This paper is the first to propose a method to efficiently evaluate SST aggregation queries for the COUNT, SUM, and AVG aggregation functions. Based on a discrete time model and a discrete, 1.5 dimensional space model that represents a road network, we generalize the concept of (temporal) constant intervals towards constant rectangles that represent maximal rectangles in the space-time domain over which the aggregation result is constant. We propose a new data structure, termed SST-tree, which extends the Balanced Tree for one-dimensional temporal aggregation towards the support for two-dimensional, spatio-temporal aggregation. The main feature of the Balanced Tree to store constant intervals in a compact way by using two counters is extended towards a compact representation of constant rectangles in the space-time domain. We propose and evaluate two variants of the SST-tree. The SSTT-tree and SSTH-tree use trees and hashmaps to manage spacestamps, respectively. Our experiments show that both solutions outperform a brute force approach in terms of memory and time. The SSTH-tree is more efficient in terms of memory, whereas the SSTT-tree is more efficient in terms of time.
-
Michael Böhlen, Johann Gamper, Christian Jensen, Richard Snodgrass, SQL-Based Temporal Query Languages, Encyclopedia of Database Systems: pages 2762-2768; ISBN 978-0-387-35544-3; 2009. (incollection)
-
Johann Gamper, Michael Böhlen, Christian Jensen, Temporal Aggregation, Encyclopedia of Database Systems: pages 2924-2929; ISBN 978-0-387-35544-3; 2009. (incollection)
-
Michael Böhlen, Temporal Coalescing, Encyclopedia of Database Systems: pages 2932-2936; ISBN 978-0-387-35544-3; 2009. (incollection)
-
Michael Böhlen, Christian Jensen, Richard Snodgrass, Temporal Compatibility, Encyclopedia of Database Systems: pages 2936-2939; ISBN 978-0-387-35544-3; 2009. (incollection)
-
Michael Böhlen, Temporal Query Processing, Encyclopedia of Database Systems: pages 3012-3015; ISBN 978-0-387-35544-3; 2009. (incollection)
-
Boris Glavic, Gustavo Alonso, The Perm Provenance Management System in Action, SIGMOD '09: International Conference on Management of Data (demonstration) 2009. (inproceedings)
In this demonstration we present the Perm provenance management system (PMS). Perm
is capable of computing, storing and querying provenance information for the relational data model.
Provenance is computed by using query rewriting techniques to annotate tuples with provenance information.
Thus, provenance data and provenance computations are represented as relational data and queries and, hence,
can be queried, stored and optimized using standard relational database techniques. This demo shows the complete
Perm system and lets attendants examine in detail the process of query rewriting and provenance retrieval
in Perm, the most complete data provenance system available today. For example, Perm supports lazy
and eager provenance computation, external provenance
and various contribution semantics.
-
Samuel Mezger, Untersuchung der Skalierbarkeit verschiedener Datenbankmanagementsysteme unter hohen Nutzerzahlen, August 2009. (misc/Facharbeit)
The work presented here describes measurements of transaction throughput for different database
management systems that focus on concurrency control. The measurements were taken for IBM
DB2 9.5, PostgreSQL 8.3 and Microsoft SQL Server 2008. During the measurements, the follow-
ing parameters were being changed to determine their effect on throughput: the isolation level,
the amount of memory available for the database?s buffer pool, the database?s cardinality and the
amount of operations per transaction. When trying to relate the measurements? results to the expec-
tations based on theoretical principles, it is found that while some effects show as expected, many
phenomena have to be attributed to speci?c implementations of the different database management
systems. Expected results like lock thrashing and throughput-limitations due to I/O-performance
or the CPU?s processing speed are apparent. Unexpectedly, I/O-performance is a limiting factor
not only when small database buffer pools are used, and lock thrashing affects all database man-
agement systems in a different way. Furthermore, it is found that all of the used management
systems can reach higher throughput numbers at higher isolation levels. DB2 is noted to break con-
nections when the database?s buffer pool is chosen too large, while SQL Server does the same when
the database?s buffer pool is chosen too small. For PostgresSQL, transaction throughput is reduced
whenthe level of concurrency is increased. This happens due to the multi-versioning protocol used
by PostgreSQL, which leads to an increase in memory consumption under these conditions.
-
Patrick Ziegler, Klaus R. Dittrich, Ela Hunt, A Call for Personal Semantic Data Integration, Workshop on Information Integration Methods, Architectures, and Systems (IIMAS 2008) (in conjunction with ICDE 2008) 2008. (inproceedings)
-
Juozas Gordevicius, Johann Gamper, Michael Böhlen, A Greedy Approach Towards Parsimonious Temporal Aggregation, TIME 2008: 15th International Symposium on 16-18 June 2008. (inproceedings)
Temporal aggregation is a crucial operator in temporal databases and has been studied in various flavors. In instant temporal aggregation (ITA) the aggregate value at time instant t is computed from the tuples that hold at t. ITA considers the distribution of the input data and works at the smallest time granularity, but the result size depends on the input timestamps and can get twice as large as the input relation. In span temporal aggregation (STA) the user specifies the timestamps over which the aggregates are computed and thus controls the result size. In this paper we introduce a new temporal aggregation operator, called greedy parsimonious temporal aggregation (PTAg), which combines features from ITA and STA. The operator extends and approximates ITA by greedily merging adjacent tuples with similar aggregate values until the number of result tuples is sufficiently small, which can be controlled by the application. Thus, PTAg considers the distribution of the data and allows to control the result size. Our empirical evaluation on real world data shows good results: considerable reductions of the result size introduce small errors only.
-
Christoph Sturm, Klaus R. Dittrich, Patrick Ziegler, An Access Control Mechanism for P2P Collaborations, DaMaP '08: Proceedings of the 2008 international workshop on Data management in peer-to-peer systems 2008, ACM. (inproceedings)
-
Arturas Mazeika, Michael Böhlen, Daniel Trivellato, Analysis and Interpretation of Visual Hierarchical Heavy Hitters of Binary Relations, ADBIS 2008: Analysis and Interpretation of Visual HHHs of Binary Relations; Lecture Notes in Computer Science Volume 5207/2008 page 168-183; ISBN 978-3-540-85712-9 2008. (inproceedings)
The emerging field of visual analytics changes the way we model, gather, and analyze data. Current data analysis approaches suggest to gather as much data as possible and then focus on goal and process oriented data analysis techniques. Visual analytics changes this approach and the methodology to interpret the results becomes the key issue. This paper contributes with a method to interpret visual hierarchical heavy hitters (VHHHs). We show how to analyze data on the general level and how to examine specific areas of the data. We identify five common patterns that build the interpretation alphabet of VHHHs. We demonstrate our method on three different real world datasets and show the effectiveness of our approach.
-
Nikolaus Augsten, Michael Böhlen, Curtis E. Dyreson, Johann Gamper, Approximate Joins for Data-Centric XML, ICDE 2008: 24th International Conference on 7-12 April 2008. (inproceedings)
In data integration applications, a join matches elements thatare common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approxi- mate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies.
In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate win- dowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable.
-
Ionut Subasu, Patrick Ziegler, Klaus R. Dittrich, Harald C. Gall, Architectural Concerns for Flexible Data Management, EDBT 2008 Workshops, March 2008, ACM. (inproceedings/Workshop paper)
Evolving database management systems (DBMS) towards
more flexibility in functionality, adaptation to changing re-
quirements, and extensions with new or different compo-
nents, is a challenging task. Although many approaches
have tried to come up with a flexible architecture, there
is no architectural framework that is generally applicable to
provide tailor-made data management and can directly inte-
grate existing application functionality. We discuss an alter-
native database architecture that enables more lightweight
systems by decomposing the functionality into services and
have the service granularity drive the functionality. We pro-
pose a service-oriented DBMS architecture which provides
the necessary flexibility and extensibility for general-purpose
usage scenarios. For that we present a generic storage ser-
vice system to illustrate our approach.
-
Simeon Simoff, Michael Böhlen, Arturas Mazeika, Assisting Human Cognition in Visual Data Mining, Visual Data Mining: Assisting Human Cognition in Visual Data Mining; Lecture Notes in Computer Science Volume 4404/2008 page 264-280; ISBN 978-3-540-71079-0 2008. (incollection)
As discussed in Part 1 of the book in chapter Form-Semantics-Function. A Framework for Designing Visualisation Models for Visual Data Mining the development of consistent visualisation techniques requires systematic approach related to the tasks of the visual data mining process. Chapter Visual discovery of network patterns of interaction between attributes presents a methodology based on viewing visual data mining as a reflection-in-action process. This chapter follows the same perspective and focuses on the subjective bias that may appear in visual data mining. The work is motivated by the fact that visual, though very attractive, means also subjective, and non-experts are often left to utilise visualisation methods (as an understandable alternative to the highly complex statistical approaches) without the ability to understand their applicability and limitations. The chapter presents two strategies addressing the subjective bias: guided cognition and validated cognition, which result in two types of visual data mining techniques: interaction with visual data representations, mediated by statistical techniques, and validation of the hypotheses coming as an output of the visual analysis through another analytics method, respectively.
-
Ira Assent, Ralph Krieger, Boris Glavic, Thomas Seidl, Clustering Multidimensional Sequences in Spatial and Temporal Databases, International Journal on Knowledge and Information Systems (KAIS) Vol. 16, Issue 1 2008. (article)
Many environmental, scientific, technical or medical database applications require effective and efficient mining of time series, sequences or trajectories of measurements taken at different time points and positions forming large temporal or spatial databases. Particularly the analysis of concurrent and multidimensional sequences poses new challenges in finding clusters of arbitrary length and varying number of attributes. We present a novel algorithm capable of finding parallel clusters in different subspaces and demonstrate our results for temporal and spatial applications. Our analysis of structural quality parameters in rivers is successfully used by hydrologists to developmeasures for river quality improvements.
-
Sascha Nedkoff, DBDoc Entwurf und Implementierung einer Anwendung zur partiellen Automation des Dokumentations-Prozesses für Datenbanken, January 2008. (diplomathesis)
Nowadays most relational database systems support the specification and storage of user-defined comments for database objects. Those comments can be considered as a rudimentary documentation of the database schema. But alone they are insufficient and inconvenient to document a database, because they can only be accessed in a cumbersome way and for a documentation many other schema informations are also relevant. Within this thesis an application for a partial automation of the documentation process is developed and implemented, which is capable to generate a database documentation by accessing the userdefined comments and schema informations. Thereby it should generally support various output formats and various database systems as well as database design patterns.
-
Stephan Blatti, Entwurf und Implementierung eines Provenance Browsers für die Visualisierung von Data Provenance, February 2008. (diplomathesis)
-
Svetlana Gerster, Entwurf und Umsetzung eines Prototyps zur effizienten Speicherung von Hoch-Volumen Prozessdaten, September 2008. (diplomathesis)
This diploma thesis deals with handling and storing of process data inside a company. The information
technology enabled a breadth support as well for a single process action as for complete process. Many
of companies make an effort for modeling and automation of their business processes. Workflow Management
Systems take a strong position in process supporting. Workflow Systems not only coordinate process flow
but also collect informations about process execution. Such informations are important for process monitoring
and improvement. Enormous ascending of process data issues questions for their efficient storage. This work
goes into the matter and attempts to find a solution for efficient data storage. The middle question is
handling of process data by application and storage of these data into a relational database.
The goal of this thesis is to analyze the current solution and to elaborate propositions of possible improvement.
-
Lukas Knauer, Konzeption einer Abfragesprache für Pylonix, May 2008. (diplomathesis)
Although documents are an important part of everyday business, there is no satisfying solution to manage them. Documents contain information crucial to business. Until today, It is a unsolved challenge to extract desired information from the giant amount of documents produced.
In oppisition to documents, other business data is highly structured an can be stored in databases. The storage in a database offers many advantages such as concurrent processing and optimized search. Thus it is desriable to use these features in document management.
The new approach Pylonix offers an architecture an a data model to store complex documents in databases. A flexible and powerful query language TXQL (TeXt Query Language) is designed and discussed within the scope of this master thesis. This language is able to query and process all elements, information as well as metadata of documents. It allows complex and comprehensive queries of arbitrary elements of complex documents that are stored in Pylonix. Furthermore TXQL offers facilities to manipulate every Element of such a document.
-
Christian Tilgner, Dietrich Christopeit, Pylonix Data Model, Department of Informatics, University of Zurich, May 2008. (techreport/Technical Report)
-
Christian Tilgner, Dietrich Christopeit, Klaus R. Dittrich, Patrick Ziegler, Pylonix: A Database Module for Collaborative Document Management, Twelfth East-European Conference on Advances in Databases and Information (ADBIS), September 2008. (inproceedings)
-
Romans Kasperovics, Michael Böhlen, Johann Gamper, Representing Public Transport Schedules as Repeating Trips, TIME '08. 15th International Symposium on 16-18 June 2008. (inproceedings)
The movement in public transport networks is organized according to schedules. The real-world schedules are specified by a set of periodic rules and a number of irregularities from these rules. The irregularities appear as cancelled trips or additional trips on special occasions such as public holidays, strikes, cultural events, etc. Under such conditions, it is a challenging problem to capture real-world schedules in a concise way. This paper presents a practical approach for modelling real-world public transport schedules. We propose a new data structure, called repeating trip, that combines route information and the schedule at the starting station of the route; the schedules at other stations can be inferred. We define schedules as semi-periodic temporal repetitions, and store them as pairs of rules and exceptions. Both parts are represented in a tree structure, termed multislice, which can represent finite and infinite periodic repetitions. We illustrate our approach on a real-world schedule and we perform in-depth comparison with related work.
-
Patrick Ziegler, Ela Hunt, Semantic Mashups with BioXMash, Data Integration in the Life Sciences 2008 (DILS 2008) 2008. (inproceedings)
-
Michael Böhlen, Linas Bukauskas, Arturas Mazeika, Peer Mylov, The 3DVDM Approach: A Case Study with Clickstream Data, Visual Data Mining; A Case Study with Clickstream Data; Lecture Notes in Computer Science Vol. 4404/2008 page 13-29, ISBN 978-3-540-71079-0 2008. (incollection)
Clickstreams are among the most popular data sources because Web servers automatically record each action and the Web log entries promise to add up to a comprehensive description of behaviors of users. Clickstreams, however, are large and raise a number of unique challenges with respect to visual data mining. At the technical level the huge amount of data requires scalable solutions and limits the presentation to summary and model data. Equally challenging is the interpretation of the data at the conceptual level. Many analysis tools are able to produce different types of statistical charts. However, the step from statistical charts to comprehensive information about customer behavior is still largely unresolved. We propose a density surface based analysis of 3D data that uses state-of-the-art interaction techniques to explore the data at various granularities.
-
Carl-Christian Kanne, Alexander Böhm, The Demaq System: Declarative Development of Distributed Applications, Proceedings of the 28th ACM SIGMOD/PODS International Conference on Management of Data / Principles of Database Systems, Vancouver, BC, Canada. 2008. (inproceedings)
The goal of the Demaq project is to investigate a novel way of thinking about distributed applications that are based on the asynchronous exchange of XML messages. Unlike today's solutions that rely on imperative programming languages and multi-tiered application servers, Demaq uses a declarative language for implementing the application logic as a set of rules. A rule compiler transforms the application specifications into execution plans against the message history, which are evaluated using our optimized runtime engine. This allows us to leverage existing knowledge about declarative query processing for optimizing distributed applications.
-
Michael Böhlen, Johann Gamper, Christian Jensen, Towards General Temporal Aggregation, BNCOD 2008 Proceedings of the 25th British National Conference on Database (BNCOD); Lecture Notes in Computer Science Volume 5071/2008 page 257-269; ISBN 978-3-540-70503-1 2008. (inproceedings)
Most database applications manage time-referenced, or temporal, data. Temporal data management is difficult when using conventional database technology, and many contributions have been made for how to better model, store, and query temporal data. Temporal aggregation illustrates well the problems associated with the management of temporal data. Indeed, temporal aggregation is complex and among the most difficult, and thus interesting, temporal functionality to support. This paper presents a general framework for temporal aggregation that accommodates existing kinds of aggregation, and it identifies open challenges within temporal aggregation.
-
Daniel Trivellato, Arturas Mazeika, Michael Böhlen, Using 2D Hierarchical Heavy Hitters to Investigate Binary Relationships, Visual Data Mining; Using 2D HHHs to Investigate Binary Relationships; Lecture Notes in Computer Science Vol. 4404/2008 pages 215-235, ISBN 978-3-540-71079-0 2008. (incollection)
This chapter presents VHHH: a visual data mining tool to compute and investigate hierarchical heavy hitters (HHHs) for two-dimensional data. VHHH computes the HHHs for a two-dimensional categorical dataset and a given threshold, and visualizes the HHHs in the three dimensional space. The chapter evaluates VHHH on synthetic and real world data, provides an interpretation alphabet, and identifies common visualization patterns of HHHs.
-
Arturas Mazeika, Michael Böhlen, Peer Mylov, Using Nested Surfaces for Visual Detection of Structures in Databases, Visual Data Mining: Using Nested Surfaces for Visual Detection of Structures in Databases; Lecture Notes in Computer Science Volume 4404/2008 page 91-102; ISBN 978-3-540-71079-0 2008. (incollection)
We de?ne, compute, and evaluate nested surfaces for the purpose of visual data mining. Nested surfaces enclose the data at various density levels, and make it possible to equalize the more and less pronounced structures in the data. This facilitates the detection of multiple structures, which is important for data mining where the less obvious relationships are often the most interesting ones. The experimental results illustrate that surfaces are fairly robust with respect to the number of observations, easy to perceive, and intuitive to interpret. We give a topology-based de?nition of nested surfaces and establish a relationship to the density of the data. Several algorithms are given that compute surface grids and surface contours, respectively.
-
Simeon Simoff, Michael Böhlen, Arturas Mazeika, Visual Data Mining: An Introduction and Overview, Visual Data Mining - Theory, Techniques and Tools for Visual Analytics; Lecture Notes in Computer Science Vol. 4404/2008 page 1-12, ISBN 978-3-540-71079-0 2008. (incollection)
In our everyday life we interact with various information media, which present us with facts and opinions, supported with some evidence, based, usually, on condensed information extracted from data. It is common to communicate such condensed information in a visual form - a static or animated, preferably interactive, visualisation. For example, when we watch familiar weather programs on the TV, landscapes with cloud, rain and sun icons and numbers next to them quickly allow us to build a picture about the predicted weather pattern in a region. Playing sequences of such visualisations will easily communicate the dynamics of the weather pattern, based on the large amount of data collected by many thousands of climate sensors and monitors scattered across the globe and on weather satellites. These pictures are fine when one watches the weather on Friday to plan what to do on Sunday - after all if the patterns are wrong there are always alternative ways of enjoying a holiday. Professional decision making would be a rather different scenario. It will require weather forecasts at a high level of granularity and precision, and in real-time. Such requirements translate into requirements for high volume data collection, processing, mining, modelling and communicating the models quickly to the decision makers. Further, the requirements translate into high-performance computing with integrated efficient interactive visualisation. From practical point of view, if a weather pattern can not be depicted fast enough, then it has no value. Recognising the power of the human visual perception system and pattern recognition skills adds another twist to the requirements - data manipulations need to be completed at least an order of magnitude faster than real-time in order to combine them with a variety of highly interactive visualisations, allowing easy remapping of data attributes to the features of the visual metaphor, used to present the data. In this few steps in the weather domain, we have specified some requirements towards a visual data mining system.
-
Michael Keller, Visualisierung von datenbankunterstützten Prozessen, February 2008. (bachelorsthesis)
This paper describes new possibilities for the visualization of workflow-runtime-data and shows their exemplary use in a practical software-project. It discusses different approaches to the graphical representation of this data. The most important insight is that it is very important for a useful visualization to visualize additional information beside the nodes and edges. This data should be positioned close to the related edge or node to make sure it can easily be matched to its related object. A visualization of all the available data will lead to an overloaded graph, which makes the possibility to hide and show data an important requirement.
-
Patrick Ziegler, Evaluation of SIRUP with the SIRUP Classification of Data Integration Conflicts, University of Zurich, July 2007. (techreport)
-
Claudio Jossen, Klaus R. Dittrich, The Process of Metadata Modelling in Industrial Data Warehouse Environments, BTW Workshops 2007, March 2007, Verlagshaus Mainz, Aachen. (inproceedings/Workshop)
-
Danar Barzanji, Visualisierung von Metadaten-Hierarchien in einer serviceorientierten Architektur, December 2007. (diplomathesis)
This diploma thesis aims at the design and implementation of a web-application, which visualizes a hierarchical structure for metadata in a service-oriented architecture. Two major problems of traditional web applications were detected. They were usability and performance problems. These problems reduce the capability of web application to visualize and navigate in the metadata models. For this reason new technologies for developing of web application were analyzed. The main aim of this analysis was to identify technologies that support developing of rich internet applications (RIAs). RIAs are web applications that have the features and functionality of traditional desktop applications. Two technologies were chosen to develop a RIA. They were Ajax-Technologies and JavaServer Faces.
-
Markus Innerebner, Michael Böhlen, Igor Timko, A web-enabled extension of a spatio-temporal DBMS, GIS '07: Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems 2007, ACM. (inproceedings)
Many database applications deal with spatio-temporal phenomena, and during the last decade a lot of research targeted location-based services, moving objects, traffic jam preventions, meteorology, etc. In strong contrast, there exist only very few proposals for an implementation of a spatio-temporal database system let alone a web-based spatio-temporal information system.
This paper describes the design and implementation of a web-based spatio-temporal information system. The system uses Secondo as spatio-temporal DBMS for handling moving objects and MapServer as an OGC-compliant rendering engine for static spatial data. We describe the architecture of the system and compare our system with a standalone application. The paper investigates in detail issues that arise in the context of the web. First, we describe an implementation of a lightweight client that takes advantage of the functionality offered by Secondo and MapServer. Second, we describe how moving objects can be represented in GML. We discuss possible GML representations, propose an extension of GML that uses 3D segments (2D location + time) to represent moving objects, and present experiments that compare the solutions.
-
Martin Spörri, Administration of Metadata Models with Semantic Web Technologies, March 2007. (diplomathesis)
This thesis was written between September 2006 and March 2007 as a diploma thesis at the Database Technology Group, which is part of the department of informatics at the University of Zurich. The aim on one hand was to show what Semantic Web technologies are and how they can be used to administrate metadata. On the other hand the mission was to build a standalone software application that integrates into the existing metadata management system of Helsana Versicherungen AG and provides additional flexibility and functionality to the system. The thesis is divided into three parts: After a short introduction, the first part describes terms and technologies related to the Semantic Web and metadata management. The second part covers the planning, implementation and evaluation of the software application that was built, and the third part contains an overview of the work that was done as well as an outlook to possible further development.
-
Patrick Ziegler, Klaus R. Dittrich, Data Integration ? Problems, Approaches, and Perspectives., Conceptual Modelling in Information Systems Engineering, Editor(s): John Krogstie In, Andreas Opdahl, Sjaak Brinkkemper; 2007, Springer. (inproceedings)
-
Boris Glavic, Klaus R. Dittrich, Data Provenance: A Categorization of Existing Approaches, BTW '07: 12. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web, March 2007, Verlagshaus Mainz, Aachen. (inproceedings)
In many application areas like e-science and data-warehousing detailed
information about the origin of data is required. This kind of information is
often referred to as data provenance or data lineage. The provenance of a data
item includes information about the processes and source data items that lead
to its creation and current representation. The diversity of data
representation models and application domains has lead to a number of more or
less formal definitions of provenance. Most of them are limited to a special
application domain, data representation model or data processing facility. Not
surprisingly, the associated implementations are also restricted to some
application domain and depend on a special data model. In this paper we give a
survey of data provenance models and prototypes, present a general
categorization scheme for provenance models and use this categorization scheme
to study the properties of the existing approaches. This categorization enables
us to distinguish between different kinds of provenance information and could
lead to a better understanding of provenance in general. Besides the
categorization of provenance types, it is important to include the storage,
transformation and query requirements for the different kinds of provenance
information and application domains in our considerations. The analysis of
existing approaches will assist us in revealing open research problems in the
area of data provenance.
-
Tobias Schlaginhaufen, Design and Implementation of a Database Client Application for Inserting, Modifying, Presentation and Export of Bitemporal Personal Data, April 2007. (diplomathesis)
There exists only little support for temporal data in conventional relational databases, although many applications require temporal or even bitemporal storage of data. This thesis describes the implementation of a bitemporal database and a client application pursuant to it for managing study subjects of a longitudinal etiological study about adjustment and mental health. Designing a bitemporal database on top of a relational database model involves the dilemma of time-normalization. Either one uses the well-established storage organization and query evaluation techniques of relational databases and accepts a certain redundant storage of data. Or one time-normalizes the data model and avoids redundancy but, at the same time, one has to accept a degenerated relational model, which is complex, difficult to handle and may degrade the performance of a relation database system. We present an approach striking the balance between these two extremes.
-
Philippe Hochstrasser, Entwurf und Implementierung einer Anwendung zur computergestützten Durchführung von klinischen Interviews, September 2007. (diplomathesis)
The national main research topic, the Sesam-Project, has a partial study, which aims to design and implement a Database for the main project, requires an application to accomplish standardized interviews. The existing software DIAX, needs to be replaced by the application which stands out from the old applikation by implementing additional functionality. Interview-definitions as XML documents can be read and interpreted by the application, to accomplish the interview and supporting the interviewer during the talk. The collected data can be exported to be read easily by the database. The present report describes problems and their resolutions which appeared during the design and implementation of the application.
-
Humard Claude, Entwurf und Implementierung einer modular erweiterbaren Anwendung für Eingabe und Import von heterogenen Daten, May 2007. (diplomathesis)
In the context of the national main research Sesam, an application is required to validate and store the results of research activities including their metadata. The measured data is very heterogeneous as it contains barcodes, questionnaires or even video files. The main challenge of this thesis is the development of an architecture, which can be easily extended to support new data formats. The aim of this diploma thesis is to develop a draft and an implementation of a modular application written in Java, which covers the requirements mentioned above. Additionally, the used concepts, design patterns and frameworks are described in detail. Furthermore, in-depth presentation of the developed architecture is included to support further enhancements of the application.
-
Sara Khaleghi, Erstellung und Bewertung eines Konzeptes für die Archivierung und die Bereinigung von Stammdaten und Kursdaten im Bankenbereich, January 2007. (diplomathesis)
The goal of this thesis was the creation and validation of an archiving concept for valor-specific static and pricing data and the making of a data housekeeping concept for the UBS AG. Also proposals for the optimisation of the existing archiving solution were suggested. To achieve these goals the current system landscape and general framework were analysed, new requirements for the archiving collected and with respect to the current developmental state of the archiving technology a concept derived. For the existing archiving solution suggestions for optimisation were proposed and a framework for the data housekeeping developed, which helps to create a housekeeping plan as soon as the outstanding business requirements have been collected.
-
Arturas Mazeika, Michael Böhlen, Nick Koudas, Divesh Srivastava, Estimating the selectivity of approximate string queries, ACM Trans. Database Syst. 32 2007. (article)
Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.
We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.
-
Patrick Ziegler, Evaluation of SIRUP with the THALIA Benchmark for Data Integration Systems, Department of Informatics, University of Zurich 2007. (techreport)
-
Stefan Schurgast, Export von Datenbankinhalten in Datenformate von Statistikprogrammen, December 2007. (bachelorsthesis)
Das sesamDB Projekt ist ein Teilprojekt der interdisziplinären Langzeitstudie sesam zur Ätiologie von
psychischen Erkrankungen. Es beschäftigt sich mit der Entwicklung der Datenbank fu?r
wissenschaftliche und administrative Daten von sesam sowie der Implementierung verschiedener
Clientanwendungen. Um die in sesam erhobenen Daten mittels Statistiksoftware analysieren zu
können, wurde eine Applikation entwickelt, die Daten aus der sesamDB in gängige Statistikformate
exportiert. Eine grafische Benutzeroberfläche ermöglicht es dem Anwender, die benötigten Daten
ohne Kenntnisse u?ber den Datenbankaufbau oder Datenanfragesprachen zu erhalten. Diese Arbeit
enthält eine Zusammenstellung verwandter Arbeiten sowie den Entwicklungsprozess und die
Architektur des Exportprogramms, Sesam Export Manager.The sesamDB Project is a subproject of the interdisciplinary long time study sesam about the etiology
of mental health. Its main task is to develop a database for scientific and administrative data for
sesam as well as the implementation of client applications. In order to analyze the stored data with
statistical analysis software, Sesam Export Manager has been built to extract data from sesamDB to
data types of popular statistics applications. The therefore developed graphical user interface helps
the user to obtain the data he needs without having knowledge of the underlying database schemes
or query languages. This paper contains a composition of related work, the development process and
the architecture of Sesam Export Manager.
-
Claudio Jossen, Metadaten Management - Grundlagen und industrielle Praxis, 12007. (book)
-
Annette Gähler, Simplifying Master Data Access, November 2007. (diplomathesis)
In today?s business environment, corporations are drowning in the information they have collected over the years. Whether caused by incomplete, incorrect, inconsistent, or simply inaccessible data, the management and provisioning of accurate and consistent enterprise data is acomplex and time-consuming task. Fast and easy access to up-to-date master data is a necessary precondition in today?s knowledge-centric business environment. In a rapidly changing knowledge environment, the question arises how IT departments can adopt new forms of information architecture in order to fulfill business needs. The main contribution of the diploma thesis at hand is a new architectural approach that simplifies access to accurate and up-to-date master data by using the concept of ontology in order to build a corporate data language. Ontology is an enabler for a consistent and holistic view of enterprise master data that can be accessed and searched by means of a single interface. The proposed architecture has been successfully implemented and tested with an exploartive prototype in the master data environment of the world?s largest reinsurer, Swiss Re, but is not limited to this specific context.
-
Stéphanie Eugster, Standardisierte Datentypen (Data Items) als integrierendes Element in einer komponenten-basierten Architektur, August 2007. (diplomathesis)
UBS Wealth Management and Business Banking (WM&BB) builds its software development on its component-based origins. The interface of these software components should therefore also be based on standardized data items. In this paper, the current situation is analyzed and the requirements according to standardized data-management are recorded and evaluated. The main problems are identified as fragmented configuration management and unsatisfactory metadata quality. The discussion shows that in regards to Front-End and communication problems, a solutions process is offered through the implementation of validators and mediators. For the most important requirements, a transformation plan is drafted. The proposed solution has been validated in assistance with a prototype.
-
Patrick Ziegler, The SIRUP Approach to Personal Semantic Data Integration, October 2007. (doctoralthesis)
-
Ionut Subasu, Patrick Ziegler, Klaus R. Dittrich, Towards Service-Based Data Management Systems, Datenbanksysteme in Business, Technologie und Web (BTW 2007), Workshop Proceedings, March 2007. (inproceedings/Workshop)
-
Seung Hee Ma, Transformation und Aggregation von Extraktions-, Transformations- und Lade-Metadaten aus Data-Warehouses, October 2007. (diplomathesis)
Since 2005 a companywide metadata management system (MDMS) has been developed by the University of Zurich together with Helsana Versicherung AG. The metadata in MDMS is of different kinds and belongs to various metadatamodels. They are integrated into a data warehouse system by the extraction, transformation and load procedures. Tracing the data flow of the metadata from the data warehouse to the source systems helps to collect the required informations in order to connect the metadata models. A parser application was implemented to support the tracing of metadata from the datawarehouse. The derivation and themodification that has been performed during the transformations of the metadata can be identified by means of this program. It facilitates the transfer of complete metadata from diverse applications to the MDMS.
-
Peter Sune Jorgensen, Michael Böhlen, Versioned Relations: Support for Conditional Schema Changes and Schema Versioning, DASFAA 2007: Support for Conditional Schema Changes and Schema Versioning; Lecture Notes in Computer Science Volume 4443/2008 page 1058-1061, ISBN 978-3-540-71702-7 2007. (inproceedings)
We introduce the versioned relational data model, which allows a user to apply conditional schema changes to a populated database without breaking applications compiled against an existing schema, and without loss of existing data. Our model is based on keeping a history of conditional schema changes, and converting tuples on demand to ?t the correct schema in any schema version. We provide a concrete de?nition of schema versioning: The ability to specify an operator on any schema version, such that the tuples in the result are unaffected by schema versions created after the speci?ed schema version. Finally, we show that our model supports schema versioning.
-
Jonas Allemann, Web Service Integration and Composition for Enabling Automatic Adaption of Heterogeneous WSDL Descriptions, December 2007. (diplomathesis)
Distributed heterogeneous process composition becomes increasingly crucial and is used extensively in various kinds of applications such as web search engines, real-time systems, high performance computing, grid computing and distributed systems to provide more flexible service mapping and enable access to heterogeneous services. This work explains the basics of a Service Oriented Architecture approach for implementing distributed and heterogeneous business processes via Web Services, specially concentrating on Web Service Composition and Automated Web Service Composition, and further on shows an example of an implementation of a travel business service based on BPEL4WS (Business Process Execution Language).