Search

Skip to Search Results
  • Spring 2024

    Shahriar, Arif

    Bengali and Hind are two widely spoken yet low-resource languages. The state-of-the-art in modeling such languages uses BERT and the Wordpiece tokenizer. We observed that the Wordpiece tokenizer often breaks words into meaningless tokens, failing to separate roots from affixes. Moreover,...

1 - 1 of 1