Exploring Surprisal from Various Language Models for Predicting English Reading Times of People with Different Language Backgrounds

Clark, Shannon R

doi:doi:10.7939/r3-h37r-s638

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

250 views
421 downloads

Exploring Surprisal from Various Language Models for Predicting English Reading Times of People with Different Language Backgrounds

Author / Creator

Clark, Shannon R
Surprisal estimated by language models is predictive of reading time in first-language (L1) reading. Research is emerging to determine whether this observation extends to reading in a second language (L2). Current attempts to characterize differences in the predictive power of surprisal for L1 and L2 reading times lack exploration of the reader's language background. As such, this thesis aims to evaluate the performance of surprisals derived from various language models for predicting English reading times of people with different L1s. To this end, we trained nine language models that varied in the extent of syntactic information, lexical information, and preceding context they used to compute surprisal. Next, we developed generalized additive mixed models to predict the English reading times of L1 speakers of English, Chinese, Korean, and Spanish using surprisal. Our results showed several commonalities. First, the best-performing surprisal for all language backgrounds was derived from a standard n-gram or an n-gram with added part-of-speech tags. Second, the lexical portion of total surprisal from a probabilistic context-free grammar performed more poorly than the syntactic portion. Last, out of the surprisals estimated using only syntactic information, those that accounted for the hierarchical structure of sentences outperformed the one based purely on sequential representations. Apart from these similarities, we observed differences by language background. It appears that surprisal computed using richer context performed better for L1 speakers of left-branching languages. It also seems that surprisals derived using hierarchical syntactic information performed better for languages with a different word order than English. Further research is needed to fully characterize these differences in performance in terms of the linguistic features of the reader's L1 and the way each language model computes surprisal. Our work shows that a variety of language models produce surprisals predictive of L1 and L2 reading times in English. Since the performance of these surprisals varied by the reader's L1, our work suggests that it is important to consider language background when using language models in the study of L2 reading.
Subjects / Keywords
Graduation date

Fall 2023
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-h37r-s638
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Demmans Epp, Carrie (Computing Science)