Download the full-sized PDF of Morphological Solutions for Arabic Statistical Machine Translation and Sentiment AnalysisDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Morphological Solutions for Arabic Statistical Machine Translation and Sentiment Analysis Open Access


Other title
Statistical Machine Translation
Sentiment Analysis
Natural Language Processing
English Arabic transaltion
Type of item
Degree grantor
University of Alberta
Author or creator
Salameh, Mohammad K
Supervisor and department
Grzegorz Kondrak(Computing Science)
Colin Cherry(National Research Council Canada)
Examining committee member and department
Abram Hindle(Computing Science)
Osmar Zaiane(Computing Science)
Nizar Habash(New York University Abu Dhabi-Computer Science)
Department of Computing Science

Date accepted
Graduation date
Doctor of Philosophy
Degree level
Morphologically complex languages such as Arabic pose several challenges in Natural Language Processing (NLP) due to their complexity and token sparsity. Most techniques approach the problem by transforming the words of the language from their sparse surface form representation to a less sparse form before processing. The transformation usually takes the form of a morphological analysis or a morphological segmentation. This dissertation addresses two tasks in Arabic NLP: Statistical Machine Translation(SMT) and Sentiment Analysis. To improve English-Arabic SMT, we apply segmentation on Arabic to decrease token sparsity and enhance the correspondence between tokens of the English and Arabic language. However, due to this segmentation, the translation system is limited to extracting features based on morphemes (partial words) and only outputting morphemes during decoding. Such a system lacks knowledge of the original form of the words. We further improve translation from English to Arabic by combining both segmented and desegmented views of the target language. The system can benefit from segmentation's sparsity reduction and verifies its generation of correct words. We present a language-independent technique to desegmentation that approaches the problem as a string transduction task. We propose a new algorithm that desegments the decoder's search space encoded as a lattice, thus allowing the system to use features from the desegmented view of the search space. We extend the phrase-based statistical machine translation system to allow desegmentation during the decoding process on the fly. In addition, we conduct an experimental study to verify what matters most in morphologically segmented SMT models. Our second task is sentiment analysis, where we resort to Arabic lemmatization to improve sentiment analysis of Arabic tweets and blog posts. We explore translation in the opposite direction, from Arabic into English in order to evaluate the loss of sentiment predictability when Arabic social media posts are translated to English, manually or using an SMT system. We use state-of-the-art Arabic and English sentiment Analysis systems and develop automatically generated Arabic lexicons from lemmatized tweets to improve this task.
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication
Mohammad Salameh, Colin Cherry, Grzegorz Kondrak, Integrating Morphological Desegmentation into Phrase-based Decoding, In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics(NAACL), San Diego, California, 2016Mohammad Salameh, Colin Cherry, Grzegorz Kondrak, What Matters Most in Morphologically Segmented SMT Models? In Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-9), Denver, Colorado, 2015.Mohammad Salameh, Saif M Mohammad and Svetlana Kiritchenko, Sentiment After Translation: A Case-Study on Arabic Social Media Posts. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Denver, Colorado, June 2015.Mohammad Salameh, Colin Cherry, Grzegorz Kondrak. Lattice Desegmentation for Statistical Machine Translation. The 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, MD, June 2014.Mohammad Salameh, Colin Cherry, Grzegorz Kondrak. Reversing Morphological Tokenization in English-to-Arabic SMT. NAACL HLT 2013 Student Research Workshop, Atlanta, GA, June 2013

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (PDF/A)
Mime type: application/pdf
File size: 27242316
Last modified: 2016:06:16 16:46:27-06:00
Filename: Salameh_Mohammad_ K_201603_PhD.pdf
Original checksum: 04599366ddf32615dd5964604b546bc3
Activity of users you follow
User Activity Date