Leveraging Natural Language Processing Methods to Evaluate Automatically Generated Cloze Questions: A Cautionary Tale

Gorgun, Guher

doi:doi:10.7939/r3-2esy-m151

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

157 views
249 downloads

Leveraging Natural Language Processing Methods to Evaluate Automatically Generated Cloze Questions: A Cautionary Tale

Author / Creator

Gorgun, Guher
The purpose of this dissertation is to employ three prominent atural language processing methods to assess the feasibility of automatically evaluating cloze questions generated by automatic item generation (AIG) methods. AIG methods have been developed to address the need for a large number of items for computerized assessments as well as online learning environments. Yet, traditional methods for evaluating item quality are limited to evaluating each generated item and providing information about the quality of the generated items. In this study, we first provided an exhaustive overview of item quality criteria and evaluation methods used by AIG researchers. This allowed us to portray the current evaluation practices, their advantages, and limitations. We proposed a taxonomy of current evaluation methods used for AIG studies, namely, metric-based evaluations, human evaluators, and post-hoc evaluations. Given that current evaluation methods have several limitations and typically cannot be used for evaluating all generated items, we examined three natural language processing methods for evaluating item quality automatically. As such, this is a proof-of-concept study investigating the feasibility of various natural language processing methods for item evaluation. In this study, we used automatically generated cloze questions evaluated by crowdsource workers to investigate the utility of three prominent natural language processing methods for item evaluation. Thus, we examined the capacity of incorporating NLP and ML methods in item evaluation process for automatically generated items to render item evaluation more feasible. These methods included training three machine learning classifiers (i.e., random forest, support vector machine, and logistic regression) using inguistic features extracted from item stems and keyed responses (Study 1), fine-tuning a large-language model, namely BERT (Study 2), and instruction-tuning a generative large-language model, namely, Llama-2 (Study 3). In Study 1, the best-performing classifier was the logistic regression, followed by the random forest and support vector machine. Nonetheless, the results of ML classifiers highlighted that they are quite limited in predicting item quality. In Study 2, we fine-tuned BERT-Large and BERT-Base and found an improvement in item quality prediction compared to Study 1 results. In Study 3, the performance of the instruction-tuned Llama-2 model surpassed all other methods and achieved an acceptable performance for identifying item quality. Overall, the findings suggested the promise of tuning generative large-language models by providing specific instructions regarding item quality for automatically evaluating the quality of generated items.
Subjects / Keywords
Graduation date

Fall 2024
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/r3-2esy-m151
License

This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Educational Psychology
Specialization
- Measurement, Evaluation, and Data Science
Supervisor / co-supervisor and their department(s)
- Okan Bulut (Educational Psychology)