Data Mining with Eclipse

Status

finished by Julio Gonnet.

Final thesis (PDF, 1538 KB)

Abstract

In the past years, there has been a great interest in the field of data mining. All around the world, larger companies have been investing vast sums of money in enormous data-warehouses and powerful data mining facilities, in the hope of extracting new information and so attain an economic advantage over other companies. With today’s fast-growing technology, interoperability and tendencies for just-in-time systems, it is becoming more likely that one will use or depend on data that does not yet exist or belong to one’s self. Furthermore, from a software engineering point of view, direct access to an application’s database is not recommended, due to the entailing dependencies and coupling to the application. Ultimately, we will want to do a lot more than just mine a set of data from our local database. Be it a more powerful pre-processing of data, the integration with other business applications or the automatic creation of a report for management, we will not get around having to integrate data mining solutions in order to solve more complex problems. In our specific case, we are especially interested in the analysis of software evolution and require a data mining framework that will seamlessly integrate with an IDE, an integrated development environment such as eclipse, already offering a large variety of components that produce software-related data.

In this thesis, we present the design and development of a data mining framework, integrating arbitrary data sources, existing data mining facilities and potential data consumers. In the first two chapters, we provide a brief introduction to the world of data mining, explain the need for integration and outline the framework’s requirements. The tool’s functionality is presented as a guided tour of the framework, followed by an in-depth technical look at the framework’s main components. We then discuss the various highlights and problems encountered, present a simple proof of concept and round it off with our conclusions and an outlook to the framework’s future development.