Replication Package for: "A Search-based Training Algorithm for Cost-aware Defect Prediction"

This page provides the download containing all data used in the study as well as the scripts used to compute the results. A detailed description of the contents as well as instructions on how to use the scripts is included in README.txt.


Annibale Panichella¹, Carol V. Alexandru², Sebastiano Panichella², Alberto Bacchelli¹, Harald C. Gall²

¹) Delft University of Technology, The Netherlands
²) University of Zurich, Switzerland


Research has yielded approaches to predict future defects in software artifacts based on historical information, thus assisting companies in effectively allocating limited development resources and developers in reviewing each others' code changes. Developers are unlikely to devote the same effort to inspect each software artifact predicted to contain defects, since the effort varies with the artifacts' size (cost) and the number of defects it exhibits (effectiveness). We propose to use Genetic Algorithms (GAs) for training prediction models to maximize their cost-effectiveness. We evaluate the approach on two well-known models, Regression Tree and Generalized Linear Model, and predict defects between multiple releases of six open source projects. Our results show that regression models trained by GAs significantly outperform their traditional counterparts, improving the cost-effectiveness by up to 240%. Often the top 10% of predicted lines of code contain up to twice as many defects.