Bachelor/Master Theses and Master Project Topics

This pages lists the open BSc. and MSc. thesis descriptions, as well as the master projects opportunities currently available in the DDIS research group. Do not hesitate to contact the respective person if you are interested in one of the topics. If you would like to write a thesis about your own idea you can propose it to the person most related to what you plan to do or you can contact ddis-theses@ifi.uzh.ch directly.

 


 

Differential Privacy Synthetic Data Challenge (MSc. Project)

In 2009, Netflix ran a programming competition on improving its recommender system with the award of 1 million dollars. In this context, they released a dataset about their users and activities. To preserve users' privacy, Netflix anonymized the data, by removing personally identifiable attributes. However, researchers from Texas University, Arvind Narayanan and Vitaly Shmatikov, shown that anonymising data was not enough: they were able to identify a subset of users by linking the Netflix data with external data sources like IMDb.

In this context, where data analytics of databases of large communities is central to the development of applications and services, it is easy to imagine how many data scientists may get in touch with our personal data. As a solution, differential privacy offers a middle layer to protect our data from them: it offers strong statistical guarantees on the fact that any data scientist can analyse the data without discovering if the data of one specific user is or is not in the database. Differential privacy is used by companies like Apple, Google and Uber to let their analysts to analyse the data while avoiding individual malicious activities which may lead to new privacy leaks and scandals.

This story shows that anonymization techniques do not guarantee privacy preservation when we want to publish a dataset. In the last ten years, privacy developed more sophisticated techniques to overcome this issue. Among them, differential privacy emerged, since it offers strong statistical guarantees on what attackers may learn from a dataset. Intuitively, the idea of differential privacy is to generate a dataset with new fake users that preserves the statistical properties of the original dataset. In this way, data analysts can learn the features of the population, and they cannot learn anything about a specific user.

In this project, we want to participate in the NIST Topcoder Synthetic Data Challenge. We want to (1) explore the differential privacy algorithm for generating synthetic dataset; (2) pick one of them an improve it, and (3) apply it to a real-world dataset. Eventually, the output of the project can be submitted to the contest and hopefully win a prize!

If you are interested, please contact: Narges Ashena

top

Visual and Interactive Demonstration of Differential Privacy (MSc. Project)

In recent years, it is rare to find a person whose personal data has not been inquired and recorded by governments or companies. The recent privacy scandals put the light on the fact protecting personal data protection is challenging, while consequences of individuals' privacy can be very serious (e.g. identity theft and political election manipulation). According to the famous saying that "an ounce of prevention is worth a pound of cure", the very first step should be to educate citizens and web users.

In this context, where data analytics of databases of large communities is central to the development of applications and services, it is easy to imagine how many data scientists may get in touch with our personal data. As a solution, differential privacy offers a middle layer to protect our data from them: it offers strong statistical guarantees on the fact that any data scientist can analyse the data without discovering if the data of one specific user is or is not in the database. Differential privacy is used by companies like Apple, Google and Uber to let their analysts to analyse the data while avoiding individual malicious activities which may lead to new privacy leaks and scandals.

Practically speaking, differential privacy techniques work as black boxes, where the only required parameter is a number named budget. Intuitively, the budget indicates how quickly we are losing privacy while answering questions of the data scientists. The aim of this project is to build a visual and interactive application to raise awareness and fill the knowledge gap in privacy-related risks and solutions (specifically differential privacy) among the non-expert people who are the most affected and the least informed. The team will demonstrate this software on Zurich Scientifica event on September 2019.

If you are interested, please contact: Narges Ashena

top

Knowledge Graph Linking through Joint Embeddings (MSc. Thesis)

There exists a variety of publicly available knowledge graphs. Some knowledge graphs cover general knowledge (DBPedia, Wikidata), while others contain specific knowledge, e.g. bibliometric data (DBLP, RDF Book Mashup, Colibrary), information related to biology (Bio2RDF, BioPortal), and many more. Typically, knowledge graphs are created by independent organizations, for a specific use-case, and from a specific data source. This leads to a fragmentation of information, even though most knowledge graphs overlap with others. The challenge of Linked Data is to find ways to identify where two knowledge graphs overlap and link them accordingly. This can be achieved via mainly two paradigms: Similarities in node labels, and similarities in graph structure. The former paradigm has already been proven to be effective. The latter paradigm is currently under discussion, and we have recently developed an embedding-based method that could be modified for this purpose. The goal of this thesis is to test our method in the setting of knowledge graph alignment and to investigate how it can be combined with the label-oriented paradigm.

If you are interested, please contact: Matthias Baumgartner

top