Improving Bengali and Hindi Large Language Models

Shahriar, Arif

doi:doi:10.7939/r3-thma-q304

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

180 views
228 downloads

Improving Bengali and Hindi Large Language Models

Author / Creator

Shahriar, Arif
Bengali and Hind are two widely spoken yet low-resource languages. The state-of-the-art in modeling such languages uses BERT and the Wordpiece tokenizer. We observed that the Wordpiece tokenizer often breaks words into meaningless tokens, failing to separate roots from affixes. Moreover, Wordpiece does not take into account fine-grained character-level information. We hypothesize that modeling fine-grained character-level information or interactions between roots and affixes helps with modeling highly inflected and morphologically complex languages such as Bengali and Hindi. We used BERT with two different tokenizers - Bengali and Hindi Unigram tokenizer and a character-level tokenizer and observed better performance. Then, we pre-trained two language models accordingly and evaluated them for masked token detection, both in correct and erroneous settings, across many NLU tasks. We provide experimental evidence that Unigram and character-level tokenizers lead to better pre-trained models for Bengali and Hindi, outperforming the previous state-of-the-art and BERT with Wordpiece vocabulary. We conduct the first study investigating the efficacy of different tokenization methods in modeling Bengali and Hindi.
Subjects / Keywords
Graduation date

Spring 2024
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-thma-q304
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Barbosa, Denilson (Computing Science)