Exploring Methods for Generating and Evaluating Skill Targeted Reading Comprehension Questions

von der Ohe, Spencer McIntosh

doi:doi:10.7939/r3-r7ds-fa51

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

68 views
131 downloads

Exploring Methods for Generating and Evaluating Skill Targeted Reading Comprehension Questions

Author / Creator

von der Ohe, Spencer McIntosh
It takes skilled teachers a significant amount of time and effort to create high
quality reading comprehension questions, often making it impractical to target
a particular reader’s weaknesses. Recently, language models have been
proposed as a tool to help teachers fill this gap, allowing these teachers to
generate questions targeting specific skill types.
In this thesis, we propose SoftSkillQG, a new soft-prompt based language
model for generating skill targeted reading comprehension questions that does
not require any manual effort to target new skills. We compare SoftSkillQG
against a variety of strong baselines and show that it outperforms existing
techniques on four out of five question quality metrics for the SBRCS dataset
and human evaluation of Context Specificity on the QuAIL dataset. However,
on the QuAIL dataset, T5 WTA, a previously proposed method using
manually created prompts, outperforms SoftSkillQG in terms of perplexity
and these same five metrics.
We investigate why SoftSkillQG performs poorly relative to T5 WTA, a
method using manually created “hard” prompts, on the QuAIL dataset by
examining both the data size and prompt initialization on SoftSkillQG’s performance.
We show that dataset size may be affecting performance, but augmenting
training with silver data from the SQuAD dataset did not improve
performance. On the other hand, initializing the prompt of SoftSkillQG using
the same prompt as T5 WTA yielded nearly the same perplexity on the QuAIL
dataset.
Finally, we perform a first of its kind analysis using the human annotations
from our previous experiments to compare five different methods for evaluating
sets of generated questions. We find that: MS-Jaccard4 best captures the
diversity of a set of questions, Best Reference Evaluation aligns mostly
closely with human judgement of Answerability; Cartesian Product evaluation
aligns most closely with Context-Specificity; and Fr´echet BERT Distance
aligns mostly closely with Fluency.
Subjects / Keywords
Graduation date

Spring 2024
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-r7ds-fa51
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Fyshe, Alona (Computing Science)