Application of Natural Language Processing and Information Retrieval in Two Software Engineering Tools

  • Author / Creator
    Campbell, Hazel V
  • Many software engineering problems have traditionally been approached by applying techniques based on static analysis and fixed sets of rules. I created two novel techniques to tackle three software engineering problems: typo location, fix suggestion, and crash report bucket creation. However, unlike previous techniques based on static analysis or a fixed set of rules, these techniques are based on methods commonly used to handle natural language artifacts.

    Existing tools and previous work typically tries to be general and work with any valid program or theoretically possible output. In contrast, this thesis builds upon the success of prior work that successfully applied NLP models to code to improve code completion in an IDE (Integrated Development Environment). This thesis continues in that vein and presents tools that focus on the code that programmers actually write and the crashes that actually occur.

    First, I applied natural-language models to locate errors in source code that cause the code to fail to compile or create an error when the code runs. Language models can adapt to coding styles and idioms. My co-authors and I showed that a tool using an n-gram model of code previously compiled successfully could supplement errors with locations produced by the Java compiler. Using our tool to suggest a location after each error message produced by the Java compiler resulted in an MRR score 11-40% closer to a perfect score than the Java compiler's score. Then, my co-authors and I showed that a similar approach also worked with the Python interpreter, though it faced significantly more challenges. When combined with the Python interpreter's error messages, our approach correctly located an additional 9-23% of tested typos made by mutation. Next, my co-authors and I showed that the technique still worked in a more restricted offline setting. In addition, we showed that the approach could also accurately suggest changes to repair around a third of typos made by students.

    I also applied the TF-IDF representation and distance function to the task of bucketing (clustering) software crash reports. In all cases, performance (in terms of F1-score) matched or beat commonly used rule-based techniques. The TF-IDF-driven approach can adapt automatically to patterns in crash reports as they evolve. Additionally, several side benefits arose from using statistical techniques.Some errors in source code can be automatically repaired using a language model. Patterns in crash metadata can be extracted easily using a bag-of-words approach with a suitable tokenizer.

    This thesis’s results encourage research on approaches based on on-line off-the-shelf algorithms or models initially developed for natural-language artifacts with programming language and other software artifacts. However, this thesis’s results do not necessarily guarantee that such uses will be successful; it does indicate that they should, at least, be considered.

  • Subjects / Keywords
  • Graduation date
    Fall 2021
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.