Usage
  • 84 views
  • 84 downloads

Exploring Surprisal from Various Language Models for Predicting English Reading Times of People with Different Language Backgrounds

  • Author / Creator
    Clark, Shannon R
  • Surprisal estimated by language models is predictive of reading time in first-language (L1) reading. Research is emerging to determine whether this observation extends to reading in a second language (L2). Current attempts to characterize differences in the predictive power of surprisal for L1 and L2 reading times lack exploration of the reader's language background. As such, this thesis aims to evaluate the performance of surprisals derived from various language models for predicting English reading times of people with different L1s. To this end, we trained nine language models that varied in the extent of syntactic information, lexical information, and preceding context they used to compute surprisal. Next, we developed generalized additive mixed models to predict the English reading times of L1 speakers of English, Chinese, Korean, and Spanish using surprisal. Our results showed several commonalities. First, the best-performing surprisal for all language backgrounds was derived from a standard n-gram or an n-gram with added part-of-speech tags. Second, the lexical portion of total surprisal from a probabilistic context-free grammar performed more poorly than the syntactic portion. Last, out of the surprisals estimated using only syntactic information, those that accounted for the hierarchical structure of sentences outperformed the one based purely on sequential representations. Apart from these similarities, we observed differences by language background. It appears that surprisal computed using richer context performed better for L1 speakers of left-branching languages. It also seems that surprisals derived using hierarchical syntactic information performed better for languages with a different word order than English. Further research is needed to fully characterize these differences in performance in terms of the linguistic features of the reader's L1 and the way each language model computes surprisal. Our work shows that a variety of language models produce surprisals predictive of L1 and L2 reading times in English. Since the performance of these surprisals varied by the reader's L1, our work suggests that it is important to consider language background when using language models in the study of L2 reading.

  • Subjects / Keywords
  • Graduation date
    Fall 2023
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-h37r-s638
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.