Usage
  • 96 views
  • 204 downloads

Leveraging Natural Language Processing Methods to Evaluate Automatically Generated Cloze Questions: A Cautionary Tale

  • Author / Creator
    Gorgun, Guher
  • The purpose of this dissertation is to employ three prominent atural language processing methods to assess the feasibility of automatically evaluating cloze questions generated by automatic item generation (AIG) methods. AIG methods have been developed to address the need for a large number of items for computerized assessments as well as online learning environments. Yet, traditional methods for evaluating item quality are limited to evaluating each generated item and providing information about the quality of the generated items. In this study, we first provided an exhaustive overview of item quality criteria and evaluation methods used by AIG researchers. This allowed us to portray the current evaluation practices, their advantages, and limitations. We proposed a taxonomy of current evaluation methods used for AIG studies, namely, metric-based evaluations, human evaluators, and post-hoc evaluations. Given that current evaluation methods have several limitations and typically cannot be used for evaluating all generated items, we examined three natural language processing methods for evaluating item quality automatically. As such, this is a proof-of-concept study investigating the feasibility of various natural language processing methods for item evaluation. In this study, we used automatically generated cloze questions evaluated by crowdsource workers to investigate the utility of three prominent natural language processing methods for item evaluation. Thus, we examined the capacity of incorporating NLP and ML methods in item evaluation process for automatically generated items to render item evaluation more feasible. These methods included training three machine learning classifiers (i.e., random forest, support vector machine, and logistic regression) using inguistic features extracted from item stems and keyed responses (Study 1), fine-tuning a large-language model, namely BERT (Study 2), and instruction-tuning a generative large-language model, namely, Llama-2 (Study 3). In Study 1, the best-performing classifier was the logistic regression, followed by the random forest and support vector machine. Nonetheless, the results of ML classifiers highlighted that they are quite limited in predicting item quality. In Study 2, we fine-tuned BERT-Large and BERT-Base and found an improvement in item quality prediction compared to Study 1 results. In Study 3, the performance of the instruction-tuned Llama-2 model surpassed all other methods and achieved an acceptable performance for identifying item quality. Overall, the findings suggested the promise of tuning generative large-language models by providing specific instructions regarding item quality for automatically evaluating the quality of generated items.

  • Subjects / Keywords
  • Graduation date
    Fall 2024
  • Type of Item
    Thesis
  • Degree
    Doctor of Philosophy
  • DOI
    https://doi.org/10.7939/r3-2esy-m151
  • License
    This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.