Generating Smart Diffs Through Deep Learning


While plain-text diffs are a straight-forward way of keeping track of changes in a software project, they are poorly suited for understanding those changes. Different semantic changes might be mixed together in a single diff and it is difficult to further process diffs using automated tools.

Approaches like ChangeDistiller extract changes between two revisions based on abstract syntax trees (ASTs) instead of plain text source code. This allows them to recognize semantic changes, like whether specific elements (if conditions, classes, methods, etc.) have been added, removed, modified or even moved to other locations in the source code.

However, there are two existing problems with this idea:

  • The change types identified by tools such as ChangeDistiller have been manually crafted by researchers and can seem arbitrary.
  • Tool like this need to be implemented separately  for each specific programming language.

Since platforms such as GitHub contain the entire histories of millions of software projects, it should be possible to:

  1. Cluster different types of changes using unsupervised machine learning for any programming language.
  2. Generate a classifier that can recognize these change types instead of implementing it manually.

Goal of the Master Project

The outcome of this project should be a taxonomy of change types for different programming languages based on empirical evidence contained in open source repositories as well as the implementation of a change classifier generator.

Task Description

The main tasks of the project are:

  • Find a suitable unsupervised machine learning approach to identify common change types for different programming languages.
  • Extract 1000s of projects from GitHub and apply the learning approach to cluster the change types.
  • Implement a generator that will recognize these change types for arbitrary programming languages