Application of Natural Language Processing and Information Retrieval in Two Software Engineering Tools

Campbell, Hazel V

doi:doi:10.7939/r3-1wte-mb81

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

355 views
655 downloads

Application of Natural Language Processing and Information Retrieval in Two Software Engineering Tools

Author / Creator

Campbell, Hazel V
Many software engineering problems have traditionally been approached by applying techniques based on static analysis and fixed sets of rules. I created two novel techniques to tackle three software engineering problems: typo location, fix suggestion, and crash report bucket creation. However, unlike previous techniques based on static analysis or a fixed set of rules, these techniques are based on methods commonly used to handle natural language artifacts.

Existing tools and previous work typically tries to be general and work with any valid program or theoretically possible output. In contrast, this thesis builds upon the success of prior work that successfully applied NLP models to code to improve code completion in an IDE (Integrated Development Environment). This thesis continues in that vein and presents tools that focus on the code that programmers actually write and the crashes that actually occur.

First, I applied natural-language models to locate errors in source code that cause the code to fail to compile or create an error when the code runs. Language models can adapt to coding styles and idioms. My co-authors and I showed that a tool using an n-gram model of code previously compiled successfully could supplement errors with locations produced by the Java compiler. Using our tool to suggest a location after each error message produced by the Java compiler resulted in an MRR score 11-40% closer to a perfect score than the Java compiler's score. Then, my co-authors and I showed that a similar approach also worked with the Python interpreter, though it faced significantly more challenges. When combined with the Python interpreter's error messages, our approach correctly located an additional 9-23% of tested typos made by mutation. Next, my co-authors and I showed that the technique still worked in a more restricted offline setting. In addition, we showed that the approach could also accurately suggest changes to repair around a third of typos made by students.

I also applied the TF-IDF representation and distance function to the task of bucketing (clustering) software crash reports. In all cases, performance (in terms of F1-score) matched or beat commonly used rule-based techniques. The TF-IDF-driven approach can adapt automatically to patterns in crash reports as they evolve. Additionally, several side benefits arose from using statistical techniques.Some errors in source code can be automatically repaired using a language model. Patterns in crash metadata can be extracted easily using a bag-of-words approach with a suitable tokenizer.

This thesis’s results encourage research on approaches based on on-line off-the-shelf algorithms or models initially developed for natural-language artifacts with programming language and other software artifacts. However, this thesis’s results do not necessarily guarantee that such uses will be successful; it does indicate that they should, at least, be considered.
Subjects / Keywords
Graduation date

Fall 2021
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/r3-1wte-mb81
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Hindle, Abram (Computing Science)