Morphological Solutions for Arabic Statistical Machine Translation and Sentiment Analysis

  • Author / Creator
    Salameh, Mohammad K
  • Morphologically complex languages such as Arabic pose several challenges in Natural Language Processing (NLP) due to their complexity and token sparsity. Most techniques approach the problem by transforming the words of the language from their sparse surface form representation to a less sparse form before processing. The transformation usually takes the form of a morphological analysis or a morphological segmentation. This dissertation addresses two tasks in Arabic NLP: Statistical Machine Translation(SMT) and Sentiment Analysis. To improve English-Arabic SMT, we apply segmentation on Arabic to decrease token sparsity and enhance the correspondence between tokens of the English and Arabic language. However, due to this segmentation, the translation system is limited to extracting features based on morphemes (partial words) and only outputting morphemes during decoding. Such a system lacks knowledge of the original form of the words. We further improve translation from English to Arabic by combining both segmented and desegmented views of the target language. The system can benefit from segmentation's sparsity reduction and verifies its generation of correct words. We present a language-independent technique to desegmentation that approaches the problem as a string transduction task. We propose a new algorithm that desegments the decoder's search space encoded as a lattice, thus allowing the system to use features from the desegmented view of the search space. We extend the phrase-based statistical machine translation system to allow desegmentation during the decoding process on the fly. In addition, we conduct an experimental study to verify what matters most in morphologically segmented SMT models. Our second task is sentiment analysis, where we resort to Arabic lemmatization to improve sentiment analysis of Arabic tweets and blog posts. We explore translation in the opposite direction, from Arabic into English in order to evaluate the loss of sentiment predictability when Arabic social media posts are translated to English, manually or using an SMT system. We use state-of-the-art Arabic and English sentiment Analysis systems and develop automatically generated Arabic lexicons from lemmatized tweets to improve this task.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
    • Department of Computing Science
  • Supervisor / co-supervisor and their department(s)
    • Grzegorz Kondrak(Computing Science)
    • Colin Cherry(National Research Council Canada)
  • Examining committee members and their departments
    • Osmar Zaiane(Computing Science)
    • Nizar Habash(New York University Abu Dhabi-Computer Science)
    • Abram Hindle(Computing Science)