Using Automated Procedures to Score Written Essays in Persian: An Application of the Multilingual BERT System

  • Author / Creator
    Firoozi, Tahereh
  • The automated scoring of student essays is now recognized as a significant development in both the research and practice of educational testing. The majority of the published studies on automated essay scoring (AES) focus on outcomes in English. Studies on multilingual AES—meaning languages other than English—are, by comparison, practically non-existent. The purpose of this study is to develop, describe, and evaluate the first AES system for scoring essays in the Persian language using multilingual BERT. Multilingual BERT is a transformer-based encoder model for language representation that uses an attention mechanism to learn the contextual relations between words and sentences in a text. The Persian language version of BERT was used to grade 2,000 holistically-scored essays written by non-native language learners in Iran on a scale that ranged from 1 (Elementary) to 5 (Advanced). The performance of the BERT AES model was examined against a baseline model that only included a word embedding layer (Word2Vec). The models were evaluated using four metrics: the Quadratic Weighted Kappa, the Kappa coefficient, model accuracy, and error analysis. The BERT AES model performed with high classification consistency (QWK=0.84 vs. Baseline QWK=0.75; κ= 0.93 vs. Baseline κ= 0.82). The result from the accuracy measures shows that the BERT AES model correctly scored about 73% of the total number of essays. Of those essays considered correctly classified by the BERT AES system, more than 70% in each level except for Advanced were scored the same by the human raters (i.e., true positive). Among the essays that were incorrectly classified, more than 70% in each score level—except for Advanced—were considered incorrect (i.e., true negative). Error analysis showed that each level had some overlap with the adjacent levels, with the Upper-Intermediate and the Advanced levels having the highest number of overlaps. These results demonstrate that the BERT AES model can be used with a high degree of accuracy to predict the essay scores produced by the raters in this study. The one area where the performance results were comparatively weak was at the Advanced level due to the smaller number of essays (n=238). Augmentation provides a method that can be used to solve the text data sparsity problem when using low-resource languages like Persian. To improve model performance, sentence-level data augmentation was implemented by adding 20% more data to each score level. This approach improved the classification performance of the BERT AES model (QWKPre-SLDA = 0.84 vs. QWKpost-SLDA = 0.96; κ Pre-SLDA = 0.88 vs. κ post-SLDA = 0.96) thereby demonstrating the benefits of text augmentation. The architecture and methods described in this study can be easily adapted and used to score essays written in other non-English languages, thereby supporting the application and wide-spread use of multilingual AES.

  • Subjects / Keywords
  • Graduation date
    Spring 2023
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.