ASDS'13 Research Talks

Monday 11.3 14:00 - 15:30, Mining and Analysis

Proactive Detection of Risky Software Changes

In this talk, we will present our work which uses data stored in software repositories to proactively flag risky changes, i.e., changes that may break or cause errors in the software system, so defects can be avoided before they are widely integrated into the code. The talk will discuss the results of a year-long study involving more than 450 developers, spanning more than 60 teams to better understand and identify the risky changes. We find that attributes such as the number of lines of code added and the history of the files being modified by the change can be used to accurately identify risky changes with a recall of more than 67% and a precision that is 37-87% higher than a baseline model. Our risk models are being used today by an industrial partner to manage the risk of their software projects.

An N-gram Analysis on the complete corpus of MSR papers

Over the last decade, the MSR community has experienced a big influx of researchers bringing in new ideas, state-of-the art technology and contemporary research methods it is unclear what the future might bring. Therefore, it is a worthwhile exercise to meditate on the past, present and future of the community. In this paper, we report on a text mining analysis applied on the complete corpus of MSR papers to reflect on where we come from; where we are now; and where we should be going. We address issues like the trendy (and outdated) research topics; the frequently (and less frequently) cited cases; the popular (and emerging) mining infrastructure; and finally the proclaimed actionable information which we are deemed to uncover.

On the Naturalness of Software

Progamming languages, like their "natural" counterparts, are rich, powerful and expressive. But while skilled writers like Shakespeare, Schiller, and Rushdie delight us with their elegant, creative deployment of the power and beauty of natural language, most of what us regular mortals say and write everyday is Very Repetitive and Highly Predictable. This predictability, as most of us have learned by now, is at the heart of the modern statistical revolution in speech recognition, natural language translation, question-answering, etc. We will argue that in fact, despite the power and expressiveness of programming languages, most <<Software>> in fact is <<also>> quite repetitive and predictable, and can be fruitfully modeled using the same types of statistical models used in natural language processing. We present some practical applications of this rather unexpected finding, and present a research vision arguing that this phenomenon is potentially rich in both scientific questions, and engineering promise.

Categorizing Bugs with Social Networks: A Case Study on Four Open Source Software Communities

Efficient bug triaging procedures are an important precondition for successful collaborative software engineering projects. Triaging bugs can become a laborious task particularly in open source software (OSS) projects with a large base of comparably inexperienced part-time contributors. In this talk, an efficient and practical method will be presented that allows to identify valid bug reports which a) refer to an actual software bug, b) are not duplicates and c) contain enough information to be processed right away. Our classification is based on nine measures to quantify the social embeddedness of bug reporters in the collaboration network. We demonstrate its applicability in a case study, using a comprehensive data set of more than 700,000 bug reports obtained from the BUGZILLA installation of four major OSS communities, for a period of more than ten years. For those projects that exhibit the lowest fraction of valid bug reports, we find that the bug reporters’ position in the collaboration network is a strong indicator for the quality of bug reports. Based on this finding, we develop an automated classification scheme that can easily be integrated into bug tracking platforms and analyze its performance in the considered OSS communities. Our study highlights the potential of using quantitative measures of social organization in collaborative software engineering. It also opens a broad perspective for the integration of social network analysis in the design of software developer support infrastructures.

Green Mining: What Can We Tell Developers

All evidence points to the fact that developers do not know much about the power consumption of their own software. With the advent of mobile devices being programmed much like general purpose computers, the non-functional requirement of energy efficient is becoming far more important. In this presentation I will discuss the various stakeholders and their relation to software-oriented power consumption and their fundamental responsibilities. Furthermore what we as researcher do to help developers with this difficult issue!

The Stat! environment for data science

Like programmers thirty years ago, today's data scientists (including empirical SE researchers) accomplish their work by awkwardly coordinating multiple independent tools. These tools include database management systems (MySql), spreadsheets, scripting environments (Python), statistical programs (R, Matlab) and machine learning tools (Weka). I'll present a demo of the Stat!, a cloud-hosted "IDE" for data scientists. The goal of Stat! is to allow a data scientist to accomplish an entire workflow, from raw data to final presentations, in one environment. This integration creates the opportunity for high productivity, automated checking, and preservation of data provenance. The project's long-term goal is to democratize data analysis so that, say, the average spreadsheet user can use statistics and machine learning to draw valid conclusions about a data set of her choice.

Wednesday 13.3 10:30 - 12:00, Productivity and Factes of Mining

Reducing notification delays: What will happen if I change my code?

When developers change their code, they want to know quickly the effects of the change. How much can we reduce the delay before they know? Can we tell them right after they make the change? Can we tell them even before they make the change? We can! I'll show you how.

On Release Engineering and Developer Productivity

Software release engineering is the discipline of integrating, building, testing, packaging and delivering qualitative software releases to the end user. Whereas software used to be released in shrink-wrapped form once per year, modern companies like Intuit, Google and Mozilla only need a couple of days or weeks in between releases, while lean start-ups like IMVU release up to 50 times per day! Shortening the release cycle of a software project requires considerable process and productivity changes, yet the scope and nature of such changes are unknown to the majority of practitioners. In order to understand the challenges of organizations in shortening their projects' release cycle, we will touch on the major sub-fields of release engineering (integration, build and delivery) and their interaction with other activities like development and testing. This will allow to explore the various release engineering stakeholders' interests and open challenges, as well as how mining software repositories can help to resolve these challenges.

What is Productivity Anyway?

Many software engineering researchers claim their techniques will improve a software engineer's productivity. I've certainly been guilty of making this claim. But what is productivity anyway? Is it a concept best left undefined? Is it a concept best left unmeasured? Is it a concept best left in the 20th century? This talk will briefly explore individual and organizational notions of productivity and will seek to raise more questions than it answers.

Towards Improving Statistical Modelling of Software Engineering Data: Think Locally, Act Globally!

Much research energy in software engineering is focused on the creation of effort and defect models. Such models are important means for practitioners to judge their current project situation, optimize the allocation of their resources, and make informed future decisions. However, software engineering data contains a large amount of variability. We had earlier presented a comparison of three approaches for creating statistical models for software defects and development effort. Global models are trained on the whole dataset. In contrast, local models are trained on subsets of the dataset. Last, we build a global model that takes into account local characteristics of the data. We evaluate the performance of these three modelling approaches on several datasets and find that local modelling leads to a considerably improved fit to the data compared to global models. We demonstrated substantial improvements of local modelling over global modelling with respect to prediction performance. Such improvements are due to learning from homogeneous local sub regions of the data, and not due to over-fitting. In this presentation, I will talk about our recent advances in this area of research. In particular we examine the following questions: (a) What is the role of clustering of datasets when building local models? (b) What is the impact of the choice of clustering algorithm and its parameters on the performance of the resulting local models? and (c) For the same clustering method, modelling techniques, and predicted out- come, how do different software engineering metrics respond to local modelling?

Hongyu Zhang, Tsinghua University:

Towards Effective Management of Bugs

Software quality is important for the success of software projects. Although a range of measures have been taken to assure software quality, in reality released software systems still contain bugs. For a large and popular software system, the project team could receive a large number of bug reports. In this talk, I will introduce some of my recent work on effective management of software bugs. These methods are developed using mining software repository (MSR) techniques, and can be applied to improve software maintenance process.

Daniel German, University of Victoria:

Mining Distributed Version Control Repositories

In 2009 we discussed the promises and perils of Mining distributed version control repositories (DVCs). I will discuss that by mining more than the core repisitory we can gain a better view of how DVCs are used, how we have overcome some of the perils, but at the same time, how some of the promises of DCVs become perils.

Thursday 14.3 10:30 - 12:00, Centric based Approches

Diomidis Spinellis, Athens University of Economics and Business:

sgsh: Scatter-gather operations on large data sets and streams

The scatter-gather shell, sgsh, allows the expressive construction of sophisticated data processing pipelines. The processing elements come from the large set of powerful Unix filters. The pipelines can efficiently process large software repositories, parallelize high-latency operations, and analyze data streams. Application examples include calculating C code metrics, resolving IP addresses in parallel, and analyzing web server log streams.

An Asset-Centric Approach for Engineering Adaptive Security

Security is primarily concerned with protecting assets from harm. Identifying and evaluating assets are therefore key activities in any security engineering process – from modeling threats and attacks, discovering existing vulnerabilities, to selecting appropriate security controls. However, despite their crucial role, assets are often neglected during the development of secure software systems. Indeed, many systems are designed with fixed security boundaries and assumptions, without the possibility to adapt when assets change unexpectedly, new threats arise, or undiscovered vulnerabilities are revealed. To handle such changes, systems must be capable of dynamically enabling different security controls. In this talk assets are promoted as first-class entities in engineering secure software systems. An asset model is related to requirements, expressed through a goal model, and the objectives of an attacker, expressed through a threat model. These models are then used as input to build a causal network to analyze system security in different situations, and to enable, when necessary, a set of security controls to mitigate security threats. The three models and the causal network are used to configure the activities of a MAPE (Monitor, Analysis, Planning, and Execution). These are performed at runtime to detect changes in assets and other relevant security concerns (Monitoring), re-estimate the security risk and the utility of all configurations of security controls (Analysis), select the configuration of security controls with the best utility (Planning), and apply it on the system (Execution). The approach is illustrated through a simple example from access control systems.

Towards Collaboration-centric Pattern-based Software Development Support

Software engineering activities tend to be loosely coupled to allow for flexibly reacting to unforeseen development complexity, requirements changes, and progress delays. This flexibility comes a the price of hidden dependencies among design and code artifacts that make it difficult or even impossible to assess change impact. Incorrect change propagation subsequently results in costly errors. This position paper proposes a novel approach based on monitoring engineering activities for subsequent high-level pattern detection. Patterns of (i) collaboration structures, (ii) temporal action sequences, and (iii) artifact consistency constraints serve as input to recommendation and automatic reconfiguration algorithms for ultimately avoiding and correcting artifact inconsistencies.

Liberating Software Engineers from the tyranny of a strict modeling language

Despite technological advances, most modeling tools for software engineering restrict users to work in a particular way. They either only support a predefined modeling language, or they force users to perform work steps in a specified order. Such tools impede the creativity of software engineers during early design phases of software projects. Therefore engineers often prefer to start with paper and pencil or a whiteboard to quickly draw their ideas about a software system and its requirements. This allows them to choose any modeling notation and level of detail they think works best for expressing their thoughts. As a consequence, engineers are left with model sketches that cannot be understood and processed by software tools. In this talk we will present our FlexiSketch approach. In this ongoing research, the idea is to have a software tool for mobile devices that not only allows users to sketch whenever and whatever they want, but also allows them to define their own modeling language by annotating the sketches at any time during the modeling process. This approach is supposed to free software engineers from restrictions of particular modeling languages, while the resulting model sketches – in contrast to their counterparts on whiteboards and paper – are amenable to a step-wise, semi-automatic formalization and beautification process. We will give an outline of our approach, followed by a short demonstration of our current tool prototype.

Improving Developer Productivity via Summarization

Developers spend large amounts of time searching and navigating the source code and other sources of information (e.g., external documentation, developers’ e-mails, bug repositories, etc.). Some of this information is structured and some it is not. In some cases, a glance at a document is enough for developers to decide whether it is relevant for them or not. In other cases, developers need to read more before being able to decide on relevancy. We argue that document summarization techniques would help developers in making such decisions faster and better. Due to the heterogeneous nature of software documents, summarizing them is a challenging task. In addition, different developer tasks may require different types of summaries. In this talk we highlight the challenges of summarizes software artifacts.

Augmenting Collaboration of Heterogeneous Teams in an Interaction Room

Many (even agile) software process models provide only an organizational framework for a project's management, but do not provide support for focusing team discussions on those aspects of a system that are critical for the project's success (in terms of averting risks and creating value). Especially in heterogeneous teams with members from various business and technical backgrounds, however, insufficient understanding of the relationships between business processes and technology components is one of the main sources of project trouble. In the so-called Interaction Room, we therefore strive to make complex software projects more tangible by letting teams work with a pragmatic combination of model sketches and annotations to foster understanding of the system and its business domain, to reveal risks and uncertainties, and to track development progress from early on. The talk will report on experiences from using the Interaction Room in industry, and explore opportunities for augmenting the room with displays and sensors to facilitate more intuitive visualization, navigation and manipulation of complex project structures.