Bachelor Theses

This pages lists the BSc. Theses topics currently available in our group. Don't hesitate to contact the respective person if you are interested in one of the topics. If you would like to write a thesis about your own idea you can propose it to the person most related to what you plan to do or you can contact Prof. Bernstein directly.

Statistical Validity in Science

Many prescribed medical pills depend on effects of certain Proteins. There are millions of ways organic molecules can be rearranged in Proteins, which is reflected in the wealth of different effects these Proteins produce in our bodies. Many of these effects are discovered through academic research and published in research papers in scientific journals or presented at conferences, which entails, that findings are evaluated in some way. Often, they are evaluated using statistics, or more specifically, Null-Hypothesis-Testing (NHST). 

NHST actually has many drawbacks and is often hard to get right. For example, many of these tests (e.g. ANOVA, t-test and friends) rely on the underlying data fulfilling certain prerequisites. ANOVA, for example, requires the data to have equal variances. In a forthcoming study, we found that these assumptions are rarely checked/reported (~13% for a major scientific conference). Seeing this, we now wonder, what else may have gone wrong with the statistical validity of research papers. To this end, we plan on expanding and using our internally developed system based on Scala/Play - although, likely little code will need to be written. Instead, you'll plan a larger experiment with us, involving a couple of distinct international professors. Upon successful experimental design, follows its execution, where you (and your tool) will coordinate thousands of crowd workers towards assessing the statistical validity of research papers. Ultimately, we'll analyse our results and will try to draw an inference about some core fields of science. Such a finding might have a positive impact on many. 

If you're interested, please contact

Last update: October 4th 2016

Increasing the number of open data streams on the Web

The open data movement is increasing the number of available datasets on the Web. However, additional effort is required to make the data more accessible. While methods and best practices for static data have been developed during the past years, little attention has been dedicated to dynamic and real-time data. One of the differences of this-this kind of data is that it can be served through push-based mechanisms i.e. streaming APIs, in addition to common pull-based mechanisms i.e. Web APIs.

In this project, we want to increase the amount of real-time, streaming open data available on the Web. The main steps can be roughly summarised as follows:
1) identification of relevant datasets to be exposed through streaming APIs, e.g. data sets from here, here, or obtained by the city of Zurich (we are currently exploring such a use of datasets).
2)  publication of the identified datasets through streaming API, by using and extending the TripleWave framework.
3) design and implementation of a proof of concept on the top of this dataset to show a potential use of the API.

If you are interested, please contact:

Publication date: 16.11.2016

SPARQL query evaluation in Big Data processors

RDF is a framework to describe resources and relations among them as graphs. The query language to express and compose queries over RDF models is SPARQL. The RDF Stream Processing initiative has started investigating how SPARQL can be extended to query RDF streams, i.e. flows of time-stamped RDF data. As a result, several engines to continuously evaluate queries over RDF streams have been designed and implemented, e.g. C-SPARQL and CQELS, but these solutions suffer from scalability and performance issues.

In parallel, Big Data processors, e.g. Spark and Heron have been released. These systems can process data streams and to distribute the computation over several nodes, achieving performance not possible in pre-existing solutions.

The goal of this project to investigate if and how much Big Data processors covers the features offered by SPARQL. In the first part of the project, a use-case will be designed, by selecting a data stream and the operations to be performed on it. Next, the selected Big Data processors will be deployed, and the use-case operations will be implemented on the top of them. The project will end with a qualitative and quantitative comparison of the results obtained with the different processors.

If you are itneresed, please contact:

Publication date: 16.11.2016