Identifying Cognate Sets Across Dictionaries of Related Languages

St Arnaud, Adam, J.J.

doi:doi:10.7939/R3NV99Q98

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

669 views
1047 downloads

Identifying Cognate Sets Across Dictionaries of Related Languages

Author / Creator

St Arnaud, Adam, J.J.
Cognates are words in related languages that have originated from the same word in an ancestor language, such as the English/German word pair father/Vater. Cognate information is critical in the field of historical linguistics, where it is used to determine the relationships between languages and to construct the ancestor languages they originated from. Most recent work in cognate identification focuses on the task of clustering cognates within lists of words each having an identical definition. In that task, only orthographic or phonetic information about a word is utilized when making cognate judgments. We present a system for the more challenging task of identifying cognate sets across dictionaries of related languages. The likelihood of a cognate relationship is calculated on the basis of a rich set of features that capture both phonetic and semantic similarity, as well as the presence of regular sound correspondences. The pairwise similarity scores are combined with an average-score clustering algorithm to create sets of words from different languages that may originate from a common proto-word. When tested on the Algonquian language family, our system detects 63% of cognate sets while maintaining cluster purity of 70%.
Subjects / Keywords
Graduation date

Fall 2017
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/R3NV99Q98
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Kondrak, Grzegorz (Computing Science)
Examining committee members and their departments
- Beck, David (Linguistics)
- Amaral, J. Nelson (Computing Science)