Department of Informatics – DDIS

 

Mastertheses

This pages lists the MSc. Theses topics currently available in our group. Don't hesitate to contact the respective person if you are interested in one of the topics. If you would like to write a thesis about your own idea you can propose it to the person most related to what you plan to do or you can contact Prof. Bernstein directly.

Methods for Synchronous Human Computation / Crowdsourcing

Today, the Internet enables us to solve complex problem-solving process by interweaving both humans and machines. Whilst the community has a lot of experiences in solving asynchronous problems (e.g., OCR-tasks such as in reCaptcha), it has only little experiences in solving real-time and synchronous methods and the design of rapid refinement algorithms / pruning techniques.

In this master thesis, you will design and evaluate methods for interweaving synchronous human and machine agents in a real-time speech-to-text transcription, text translation, spell checking or news ticker task. This includes the design of the algorithm / pruning method, the prototypical implementation of it in CrowdLang, and the evaluation of the method in a field experiment.

Contact: Patrick Minder

Behavior-Based Quality Assurance in Crowdsourcing Markets

A big problem in crowdsourcing markets is the design of robust quality assurance mechanisms. Today, the lack of methods is compensated using massive redundancy in task allocation and / or the use of voting mechanisms. Recenty research showed that the use of behavioral data (e.g., working time per task, mouse movements) can be a sophisticated indicator for the resulting quality.

In this bachelor / master thesis, you will design several worker behavior-based quality predictors and evaluate them in various domains.

Contact: Patrick Minder

Information Extraction from On-line Fora

Alzheimer’s is a fatal disease of the brain, the sixth-leading cause of death in the United States and the only cause of death among the top 10 that cannot be prevented, cured or even slowed [1]. While Alzheimer has attracted significant investments from major pharmaceutical firms, it has also been a meager disease area for innovation, with most promising new drug candidates failing phase III of clinical trials [2].

Given the state of pharmaceutical treatment as urgency of the disease, doctors and caregivers have started exploring and adopting alternative non-pharmaceutical treatments to deal with the symptoms of the disease. Patients and their caregivers extensively exchange knowledge about such treatments in online forums (e.g. www.alzconnected.org, http://forum.alzheimers.org.uk/).

Contact: Abraham Bernstein

Finding the Needle in the Olympic Haystack

High frequency data volumes, as they occur, for example, in Internet TV processing, cannot be processed offline. Matching complex events on those streams of data require a reversion of the model applied to querying static data bases. One key element for guaranteed response times of online systems lies in bounding the number of items currently being processed.

In this thesis your goal will be developing strategies how to limit the number of items to be processed by means of probabilistic models. You will first investigate data from two European IPTV providers and then develop and implement different pruning strategies in the DDIS processor for TEF-SPARQL---a query language for matching complex events on streams recently developed at DDIS. You will finally evaluate your method on real world data such as the TV data from the Olympic Games 2012.

Contact: Thomas Scharrenbach

SKOS2OWL: Help the Archivars

Most archives in the world use subject-heading systems and/or thesauri to express meta-data for items in the archive. With the increasing popularity of the Linked Open Data initiative, more and more archives want to hook into that cloud or simply want to use Linekd Data as means for easier access to their archives. Yet, tools for creating Linked Data are based on the Web Ontology Language OWL. On the other hand, the Simple Knowledge Oranization System (SKOS) was developed to express subject-heading systems as Linked Data but SKOS and OWL are not quite compatible.

In this thesis you will bridge the gap between OWL and SKOS. Based on experience from previous projects with the German Public Broadcasting services (ARD) and the Swiss Federal Statistical Office (FSO) you will adjust methods for Ontology Instace Matching and/or Ontology Alignment to work with SKOS. You will implement your methods and evaluate them on real-world data and hence lower the hurdles for archives (or anyone using SKOS) to hook into the Linked Open Data cloud.

Contact: Thomas Scharrenbach

Never Again Miss Your Favourite Live-Show: Recommending the Olymic Games

Recommender systems are sucessfully employed in stores like Amazon.com. With an increasing availability of TV over the Internet (IPTV) recommendations may become a distinct feature for IPTV providers to attract current and potential future viewers. Current approaches towards recommending TV content are based on decription already available such as IMDB or Electronic Program Guides (EPG). But what about live-shows where such information is scarce?

Within the ViSTA-TV project a consortium of partners from industry (Zattoo, The BBC, Rapid-I GmbH) and academia (UZH, VU Amsterdam, TU Dortmund) investigates how recommendations can be based on actual content. IN particular we investigate how this can be accomplished for TV shows broadcasted live---such as the Olympic Summer Games 2012.

In this thesis you will have the unique chance of working in an international project on real world problems on real world data. Your task will be to develop a real-time recommendations engine based on weighted rules. You will process and evaluate your system on the complete dataset of the Olympic Summer Games 2012. We count on your recommendations!

Contact: Thomas Scharrenbach

Fuzzy String Matching over Data Streams

While the the amount of and the rate at which data is being produced increases constantly, physical limits of computer systems force us to distribute computation loads across multiple processors and machines. Because concurrency and distributed computing are hard to do, frameworks such as Apache Hadoop and Storm (github) have been developed. Scientists, data analysts, and Developers around the world use tools like these, to process and gain knowledge from the ever increasing amounts of data (Big Data). A trick, that both of these tools use, is data partitioning (grouping in Storm lingo): By dividing up the work and processing data in parallel, the amount of data that can be processed is virtually unlimited. However, this only works if the problem can be "divided up". In fuzzy string matching this poses a problem, as it is not immediately clear, how to partition the data. One possible solution is to hash all input values into buckets of "similar" items and then only compare the items of each bucket. This can be done using a technique called "Locality Sensitive Hashing" (LHS).

In this thesis, your goal is to implement and evaluate a grouping strategy for Storm using LHS. Preferably you have experience in programming Java and/or Python, you like writing efficient code (read: you know your algorithms & data structures), and are not afraid of working with Linux.

Contact: Lorenz Fischer

Implementation of a Complex Event Processing Engine in Signal Collect

Many distributed applications requires continuous and timely processing of information as it flows from producers to consumers. Examples include intrusion detection systems which analyze network traffic in real-time to identify possible attacks; environmental monitoring applications which process raw data coming from sensor networks to identify critical situations; or applications performing online analysis of stock prices to identify trends and forecast future values.

Traditional DBMSs, which need to store and index data before processing it, can hardly fulfill the requirements of timeliness coming from such domains. Therefore so called Stream and Complex-Event Processing systems have been developed in the last years. A recent ACM artikel provides a good overview: http://dl.acm.org/citation.cfm?doid=2187671.2187677

The work of the thesis is to build an interpreter for event-processing in the Signal/Collect graph-computing framework based on an ANTLR-based compiler for the DDIS Event Processing Query Language.

The programming model of Signal/Collect (http://signal-collect.googlecode.com/files/iswc2010-signalcollect.pdf) is expressive enough to concisely formulate many iterated and data-flow computations, while allowing the framework to parallelize and distribute the processing. The EP-SPARQL queries get compiled to a data-flow graph that allows for parallel and distributed query matching on large data streams.

Implementations in this theses would be in Java and/or Scala.

Contact: Jörg-Uwe Kietz & Philip Stutz