ERA

Download the full-sized PDF of Modeling Inflectional Complexity in Natural Language ProcessingDownload the full-sized PDF

Analytics

Share

Permanent link (DOI): https://doi.org/10.7939/R3HT2GR63

Download

Export to: EndNote  |  Zotero  |  Mendeley

Communities

This file is in the following communities:

Graduate Studies and Research, Faculty of

Collections

This file is in the following collections:

Theses and Dissertations

Modeling Inflectional Complexity in Natural Language Processing Open Access

Descriptions

Other title
Subject/Keyword
Morphology
Computational Linguistics
Natural Language Processing
Inflection
String Transduction
Type of item
Thesis
Degree grantor
University of Alberta
Author or creator
Nicolai, Garrett J
Supervisor and department
Grzegorz Kondrak, Computing Science
Examining committee member and department
Davood Rafiei, Computing Science
Mans Hulden, Linguistic, University of Colorado, Boulder
Colin Cherry, Computing Science
David Beck, Linguistics
Department
Department of Computing Science
Specialization

Date accepted
2017-09-05T10:00:36Z
Graduation date
2017-11:Fall 2017
Degree
Doctor of Philosophy
Degree level
Doctoral
Abstract
Inflectional morphology presents numerous problems for traditional computational models, not least of which is an increase in the number of rare types in any corpus. Although few annotated corpora exist for morphologically complex languages, it is possible for lay-speakers of the language to generate data such as inflection tables that describe patterns that can be learned by machine learning algorithms. We investigate four inflectional tasks: inflection generation, stemming, lemmatization, and morphological analysis, and demonstrate that each of these tasks can be accurately modeled using sequential string transduction methods. Furthermore, expert annotation is unnecessary: inflectional models are learned from crowd-sourced inflection tables. We first investigate inflection generation: given a dictionary form and a tag representing inflectional information, we produce inflected word-forms. We then refine our predictions by referring to the other forms within a paradigm. Results of experiments on six diverse languages with varying amounts of training data demonstrate that our approach improves the state of the art in terms of predicting inflected word-forms. We next investigate stemming: the removal of inflectional prefixes and suffixes from a word. Unlike the inflection generation task, it is not possible to use inflection tables to learn a fully-supervised stemming model; however, we exploit paradigmatic regularity to identify stems in an unsupervised manner with over 85% accuracy. Experiments on English, Dutch, and German show that our stemmers substantially outperform rule-based and unsupervised stemmers such as Snowball and Morfessor, and approach the accuracy of a fully-supervised system. Furthermore, the generated stems are more consistent than those annotated by experts. We also use the inflection tables to learn models that generate lemmas from inflected forms. Unlike stemming, lemmatization restores orthographic changes that have occurred during inflection. These models are more accurate than Morfette and Lemming on most datasets. Finally, we extend our lemmatization methods to produce complete morphological analyses: given a word, return a set of lemma / tag pairs that may have generated it. This task is more ambiguous than inflectional generation or lemmatization which typically produce only a small number of outputs. Thus, morphological analysis involves producing a complete list of lemma+tag analyses for a given word-form. Experiments on four languages demonstrate that our system has much higher coverage than a hand-engineered FST analyzer, and is more accurate than a state-of-the-art morphological tagger.
Language
English
DOI
doi:10.7939/R3HT2GR63
Rights
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication
Nicolai, G., Cherry, C., and Kondrak, G. (2015). Inflection generation as discriminative string transduction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 922-931. Association for Computational Linguistics.Nicolai, G. and Kondrak, G. (2016). Leveraging inflection tables for stemming and lemmatization. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1138-1147. Association for Computational Linguistics.Nicolai, G. and Kondrak, G. (2017). Morphological analysis without expert annotation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 211-216. Association for Computational Linguistics.Nicolai, G., Hauer, B., St Arnaud, A., and Kondrak, G. (2016). Morphological reinflection via discriminative string transduction. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 31-35. Association for Computational Linguistics.

File Details

Date Uploaded
Date Modified
2017-09-05T16:00:37.213+00:00
Audit Status
Audits have not yet been run on this file.
Characterization
File format: pdf (PDF/A)
Mime type: application/pdf
File size: 579420
Last modified: 2017:11:08 16:56:51-07:00
Filename: Nicolai_Garrett_J_201708_PhD.pdf
Original checksum: 803c42fd37b87f886b664f45d45ab543
Activity of users you follow
User Activity Date