Usage
  • 17 views
  • 48 downloads

Exploring Methods for Generating and Evaluating Skill Targeted Reading Comprehension Questions

  • Author / Creator
    von der Ohe, Spencer McIntosh
  • It takes skilled teachers a significant amount of time and effort to create high
    quality reading comprehension questions, often making it impractical to target
    a particular reader’s weaknesses. Recently, language models have been
    proposed as a tool to help teachers fill this gap, allowing these teachers to
    generate questions targeting specific skill types.
    In this thesis, we propose SoftSkillQG, a new soft-prompt based language
    model for generating skill targeted reading comprehension questions that does
    not require any manual effort to target new skills. We compare SoftSkillQG
    against a variety of strong baselines and show that it outperforms existing
    techniques on four out of five question quality metrics for the SBRCS dataset
    and human evaluation of Context Specificity on the QuAIL dataset. However,
    on the QuAIL dataset, T5 WTA, a previously proposed method using
    manually created prompts, outperforms SoftSkillQG in terms of perplexity
    and these same five metrics.
    We investigate why SoftSkillQG performs poorly relative to T5 WTA, a
    method using manually created “hard” prompts, on the QuAIL dataset by
    examining both the data size and prompt initialization on SoftSkillQG’s performance.
    We show that dataset size may be affecting performance, but augmenting
    training with silver data from the SQuAD dataset did not improve
    performance. On the other hand, initializing the prompt of SoftSkillQG using
    the same prompt as T5 WTA yielded nearly the same perplexity on the QuAIL
    dataset.
    Finally, we perform a first of its kind analysis using the human annotations
    from our previous experiments to compare five different methods for evaluating
    sets of generated questions. We find that: MS-Jaccard4 best captures the
    diversity of a set of questions, Best Reference Evaluation aligns mostly
    closely with human judgement of Answerability; Cartesian Product evaluation
    aligns most closely with Context-Specificity; and Fr´echet BERT Distance
    aligns mostly closely with Fluency.

  • Subjects / Keywords
  • Graduation date
    Spring 2024
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-r7ds-fa51
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.