This pages lists the MSc. Theses topics currently available in our group. Don't hesitate to contact the respective person if you are interested in one of the topics. If you would like to write a thesis about your own idea you can propose it to the person most related to what you plan to do or you can contact Prof. Bernstein directly.

Statistical Validity in Science

Many prescribed medical pills depend on effects of certain Proteins. There are millions of ways organic molecules can be rearranged in Proteins, which is reflected in the wealth of different effects these Proteins produce in our bodies. Many of these effects are discovered through academic research and published in research papers in scientific journals or presented at conferences, which entails, that findings are evaluated in some way. Often, they are evaluated using statistics, or more specifically, Null-Hypothesis-Testing (NHST). 

NHST actually has many drawbacks and is often hard to get right. For example, many of these tests (e.g. ANOVA, t-test and friends) rely on the underlying data fulfilling certain prerequisites. ANOVA, for example, requires the data to have equal variances. In a forthcoming study, we found that these assumptions are rarely checked/reported (~13% for a major scientific conference). Seeing this, we now wonder, what else may have gone wrong with the statistical validity of research papers. To this end, we plan on expanding and using our internally developed system based on Scala/Play - although, likely little code will need to be written. Instead, you'll plan a larger experiment with us, involving a couple of distinct international professors. Upon successful experimental design, follows its execution, where you (and your tool) will coordinate thousands of crowd workers towards assessing the statistical validity of research papers. Ultimately, we'll analyse our results and will try to draw an inference about some core fields of science. Such a finding might have a positive impact on many. 

If you're interested, please contact

Last Update: October 4th 2016

Collaborative data analysis

Crowdsourcing has raised interest in both the scientific and industrial community as a collaboration model enabling people to join forces in order to solve otherwise challenging problems. Whereas the effectiveness of crowdsourcing in solving tedious and aggregative tasks is widely acknowledged, the understanding of how to crowdsource complex and  not well defined tasks such as data analysis is not yet fully discerned.

This thesis will investigate collaborative data analysis scenarios that involve diverse people with varying (and possibly limited) relevant knowledge. The goal of this thesis is to develop and evaluate a novel approach to supporting collaborative data analysis, and contribute to a better understanding of data analysis as a collaborative and distributed process accessible to a wide range of people with a diverse set of skills.

Your work will involve analytical aspect of designing processes that will allow non-experts to collaborate with data scientists to solve most exciting questions by taking advantage of the available data. You will be focusing on either advancing the web-based prototype we have or on conducting qualitative study to better understand the requirements for such a platform. Further, you will have a chance to evaluate your ideas in the real world setting by recruiting freelancers and data scientists. If you are interested in extending your knowledge in the domain of data analysis and excited about the opportunities of democratizing the data-driven research drop me a line.

Posted: 1.9.2016

Contact: Michael Feldman

Collaborative Feature Engineering for Data Science projects

With the increasing availability of (big) data, industry's need for people who are capable of analysing data is on a rise. Indeed, the amount of people with Data Scientist on their LinkedIn profile has doubled in the last four years. Predictive modelling is one of the regular tasks a Data Scientist will get him/herself into: Building a classifier on features of training data to predict a certain attribute on unseed data.

But, how are features engineered on the the training set? A simple approach could be to declare all attributes (columns) of the data as "features", and let the classifier do the rest. On Kaggle, a popular data science platform that hosts regular competitions for predictive modelling, one doesn't get very far with this simple approach. Winning teams often explain, that a critical step to winning a competition is Feature Engineering: Refining/merging/splitting existing features and adding new ones to the data set. Basically, the team winning of the price on a Kaggle competition (between $10'000-$100'000), is the team who did the best job in Feature Engineering (and ensembling, which is out of our scope). Feature Engineering is also important in industry: The higher a classifier's accuracy, the more useful it is.

Crowd sourcing may be a way to build better features together as opposed to alone. Your Master thesis will be, to build a web-platform where data scientists can collaborate on feature engineering. We will then compare your platform to the current industry-standard of Feature Engineering (iPython Notebook). If your platform yields better results than the industry-standard, you will be responsible for a significant cost reduction in data science projects combined with a lower entrance barrier for companies to analyse their data. This may pave the way for wider adoption of data science in Switzerland and the economic advantages coming with it. :)
Through the course of your thesis, you will learn about Feature Engineering, build a platform using your desired technology and play a lot with data. Interested? Drop me a message

Posted: 29.06.2016

Contact: Patrick de Boer

Information Extraction from On-line Fora

Alzheimer’s is a fatal disease of the brain, the sixth-leading cause of death in the United States and the only cause of death among the top 10 that cannot be prevented, cured or even slowed [1]. While Alzheimer has attracted significant investments from major pharmaceutical firms, it has also been a meager disease area for innovation, with most promising new drug candidates failing phase III of clinical trials [2].

Given the state of pharmaceutical treatment as urgency of the disease, doctors and caregivers have started exploring and adopting alternative non-pharmaceutical treatments to deal with the symptoms of the disease. Patients and their caregivers extensively exchange knowledge about such treatments in online forums (e.g.,

Contact: Abraham Bernstein