Department of Informatics – DDIS

 

Mastertheses

This pages lists the MSc. Theses topics currently available in our group. Don't hesitate to contact the respective person if you are interested in one of the topics. If you would like to write a thesis about your own idea you can propose it to the person most related to what you plan to do or you can contact Prof. Bernstein directly.

Crowdsourced feature selection

Often, questions like the ones below can be answered by designing efficient decision models based on classification. Who are the potential customers we want to approach with a given solution? How can fraud and/or abuse be detected in a given system? To what degree can patients be treated/diagnosed successfully given observable symptoms? These models are elicited and studied in Machine Learning research and have lead to numerous tangible results so far. However, not all areas that are capturing human knowledge enjoy the benefits of clear and clean information representation. Consequently, the design of such models is a challenging task.

In this project you will explore decision scenarios where the classification will be done based on the non-obvious, latent knowledge coming from multiple human agents in the confines of a crowdsourcing framework. In this scenario we seek creative ways to elicit and aggregate this hidden (human) knowledge such that we can intertwine it with existing machine learning (classification) methods.

Posted: 27.06.2014

Contact: Michael Feldman

Visual Design Interface for Crowd Computing

What if you could orchestrate thousands of people at your will by merely writing a couple of lines of code? If looking at people as a highly distributed system with each person representing an unreliable node that can perform some kind of operation, the question arises on how to program such a system. We have developed CrowdLang, a high-level language with some elements borrowed from Business Process Modelling, to fill in this gap.

We would like to explore ways on how to support developers when they write CrowdLang code by implementing a visual process design interface where people can drag & drop elements, connect them and occasionally write some Scala code to steer the underlying DSL in a solid web framework built with state-of-the-art HTML5, JavaScript and Scala + the Play framework. In a complementary step, you would also develop a debugger monitoring the execution at each step, and visualize said data in the design interface.

Posted: 16.06.2014

Contact: Patrick de Boer

Information Extraction from On-line Fora

Alzheimer’s is a fatal disease of the brain, the sixth-leading cause of death in the United States and the only cause of death among the top 10 that cannot be prevented, cured or even slowed [1]. While Alzheimer has attracted significant investments from major pharmaceutical firms, it has also been a meager disease area for innovation, with most promising new drug candidates failing phase III of clinical trials [2].

Given the state of pharmaceutical treatment as urgency of the disease, doctors and caregivers have started exploring and adopting alternative non-pharmaceutical treatments to deal with the symptoms of the disease. Patients and their caregivers extensively exchange knowledge about such treatments in online forums (e.g. www.alzconnected.org, http://forum.alzheimers.org.uk/).

Contact: Abraham Bernstein

Workload Scheduling in Storm

Project for: 1-2 Student
Can be completed as: Bachelor/Master Thesis, Master Project
Posted on: July 4, 2014

Introduction

To make use of the ever increasing data volumes, we rely on massive compute clusters consisting of thousands of machines. Being able to work with compute clusters and to process large amounts of data, has become skill that is much sought after in business and research.

Companies such as Google, Yahoo, and Facebook would not be able to make use of the user data they collect without highly parallelized and distributed computer systems. One disadvantage of bach-based systems such as Apache MapReduce is, that the time it takes to process data limits the speed with which one can react to real world events. This is why continuous distributed data processing has gained traction in recent years.

Storm is a distributed fault-tolerant real-time computation framework. It is distributed and allows its users to write applications that return results in real-time. One problem in distributed computing in to decide how to assign work to the various computing nodes in a compute cluster. This process is called workload scheduling.

Project Description

The goal of this project is to develop and evaluate a workload scheduler for the Storm platform based on research results of the DDIS research group. More specifically the project entails the following tasks:

  1. Implementation of
    • a Storm scheduler that bases its scheduling strategy on these statistics using graph partitioning algorithms
    • suitable graph partitioning algorithms
    • suitable topologies for the evaluation
    • evaluation scripts to evaluate the system
  2. Evaluation of the running Storm system in terms of
    • network load
    • scalability
    • cost of moving state between machines
  3. Possible Extensions:
    • Changing the way in which message ids get created in the acking framework. This task would also require an evaluation of the system performance with (and without) this new mechanism.
    • Implementation (and evaluation) of multiple different partitioning algorithms
    • Augmenting the communication graph with node weights that reflect CPU load

Requirements

Language: The project can be completed in either English or German.

Necessary skills:

  • Good programming skills in Java
  • Good understanding of Linux and some Bash-fu ;-)
  • Experience in or interest in learning more about distributed systems (Storm/Hadoop/Zookeeper/Torque/Maui)
  • Knowledge of Python and/or Clojure is a plus

Contact: Lorenz Fischer

Finding the Needle in the Olympic Haystack

High frequency data volumes, as they occur, for example, in Internet TV processing, cannot be processed offline. Matching complex events on those streams of data require a reversion of the model applied to querying static data bases. One key element for guaranteed response times of online systems lies in bounding the number of items currently being processed.

In this thesis your goal will be developing strategies how to limit the number of items to be processed by means of probabilistic models. You will first investigate data from two European IPTV providers and then develop and implement different pruning strategies in the DDIS processor for TEF-SPARQL---a query language for matching complex events on streams recently developed at DDIS. You will finally evaluate your method on real world data such as the TV data from the Olympic Games 2012.

Contact: Thomas Scharrenbach

SKOS2OWL: Help the Archivars

Most archives in the world use subject-heading systems and/or thesauri to express meta-data for items in the archive. With the increasing popularity of the Linked Open Data initiative, more and more archives want to hook into that cloud or simply want to use Linekd Data as means for easier access to their archives. Yet, tools for creating Linked Data are based on the Web Ontology Language OWL. On the other hand, the Simple Knowledge Oranization System (SKOS) was developed to express subject-heading systems as Linked Data but SKOS and OWL are not quite compatible.

In this thesis you will bridge the gap between OWL and SKOS. Based on experience from previous projects with the German Public Broadcasting services (ARD) and the Swiss Federal Statistical Office (FSO) you will adjust methods for Ontology Instace Matching and/or Ontology Alignment to work with SKOS. You will implement your methods and evaluate them on real-world data and hence lower the hurdles for archives (or anyone using SKOS) to hook into the Linked Open Data cloud.

Contact: Thomas Scharrenbach

Never Again Miss Your Favourite Live-Show: Recommending the Olymic Games

Recommender systems are sucessfully employed in stores like Amazon.com. With an increasing availability of TV over the Internet (IPTV) recommendations may become a distinct feature for IPTV providers to attract current and potential future viewers. Current approaches towards recommending TV content are based on decription already available such as IMDB or Electronic Program Guides (EPG). But what about live-shows where such information is scarce?

Within the ViSTA-TV project a consortium of partners from industry (Zattoo, The BBC, Rapid-I GmbH) and academia (UZH, VU Amsterdam, TU Dortmund) investigates how recommendations can be based on actual content. IN particular we investigate how this can be accomplished for TV shows broadcasted live---such as the Olympic Summer Games 2012.

In this thesis you will have the unique chance of working in an international project on real world problems on real world data. Your task will be to develop a real-time recommendations engine based on weighted rules. You will process and evaluate your system on the complete dataset of the Olympic Summer Games 2012. We count on your recommendations!

Contact: Thomas Scharrenbach

Implementation of a Complex Event Processing Engine in Signal Collect

Many distributed applications requires continuous and timely processing of information as it flows from producers to consumers. Examples include intrusion detection systems which analyze network traffic in real-time to identify possible attacks; environmental monitoring applications which process raw data coming from sensor networks to identify critical situations; or applications performing online analysis of stock prices to identify trends and forecast future values.

Traditional DBMSs, which need to store and index data before processing it, can hardly fulfill the requirements of timeliness coming from such domains. Therefore so called Stream and Complex-Event Processing systems have been developed in the last years. A recent ACM artikel provides a good overview: http://dl.acm.org/citation.cfm?doid=2187671.2187677

The work of the thesis is to build an interpreter for event-processing in the Signal/Collect graph-computing framework based on an ANTLR-based compiler for the DDIS Event Processing Query Language.

The programming model of Signal/Collect (http://signal-collect.googlecode.com/files/iswc2010-signalcollect.pdf) is expressive enough to concisely formulate many iterated and data-flow computations, while allowing the framework to parallelize and distribute the processing. The EP-SPARQL queries get compiled to a data-flow graph that allows for parallel and distributed query matching on large data streams.

Implementations in this theses would be in Java and/or Scala.

Contact: Jörg-Uwe Kietz & Philip Stutz