Morphological Solutions for Arabic Statistical Machine Translation and Sentiment Analysis

Salameh, Mohammad K

doi:doi:10.7939/R35T3G488

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

456 views
547 downloads

Morphological Solutions for Arabic Statistical Machine Translation and Sentiment Analysis

Author / Creator

Salameh, Mohammad K
Morphologically complex languages such as Arabic pose several challenges in Natural Language Processing (NLP) due to their complexity and token sparsity. Most techniques approach the problem by transforming the words of the language from their sparse surface form representation to a less sparse form before processing. The transformation usually takes the form of a morphological analysis or a morphological segmentation. This dissertation addresses two tasks in Arabic NLP: Statistical Machine Translation(SMT) and Sentiment Analysis. To improve English-Arabic SMT, we apply segmentation on Arabic to decrease token sparsity and enhance the correspondence between tokens of the English and Arabic language. However, due to this segmentation, the translation system is limited to extracting features based on morphemes (partial words) and only outputting morphemes during decoding. Such a system lacks knowledge of the original form of the words. We further improve translation from English to Arabic by combining both segmented and desegmented views of the target language. The system can benefit from segmentation's sparsity reduction and verifies its generation of correct words. We present a language-independent technique to desegmentation that approaches the problem as a string transduction task. We propose a new algorithm that desegments the decoder's search space encoded as a lattice, thus allowing the system to use features from the desegmented view of the search space. We extend the phrase-based statistical machine translation system to allow desegmentation during the decoding process on the fly. In addition, we conduct an experimental study to verify what matters most in morphologically segmented SMT models. Our second task is sentiment analysis, where we resort to Arabic lemmatization to improve sentiment analysis of Arabic tweets and blog posts. We explore translation in the opposite direction, from Arabic into English in order to evaluate the loss of sentiment predictability when Arabic social media posts are translated to English, manually or using an SMT system. We use state-of-the-art Arabic and English sentiment Analysis systems and develop automatically generated Arabic lexicons from lemmatized tweets to improve this task.
Subjects / Keywords
Graduation date

Spring 2016
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R35T3G488
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Grzegorz Kondrak(Computing Science)
- Colin Cherry(National Research Council Canada)
Examining committee members and their departments
- Abram Hindle(Computing Science)
- Osmar Zaiane(Computing Science)
- Nizar Habash(New York University Abu Dhabi-Computer Science)