Unsupervised Syntax-based Probabilistic Sentence Generation

Sayehban, Shahrzad

doi:doi:10.7939/r3-9xqd-4q59

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

178 views
275 downloads

Unsupervised Syntax-based Probabilistic Sentence Generation

Author / Creator

Sayehban, Shahrzad
Sentence reconstruction and generation are essential applications in Natural Language Processing (NLP). Early studies were based on classic methods such as production rules and statistical models. Recently, the prevailing models typically use deep neural networks. In this study, we utilize deep neural networks to develop a model capable of generating new and unseen sentences or reconstructing the given input with minor changes. To achieve this goal, we develop an unsupervised tree-based model based on the Variational Autoencoder (VAE) framework.

Our approach utilizes the grammar rules of natural language and generates sentences based on phrases. This approach helps the generated sentence to be semantically and syntactically correct. Previous models typically considered the tokens sequentially, and the syntax was only learned implicitly. By contrast, our model learns both the sequence of the tokens and the syntax of the sentences explicitly in order to generate better samples. The variational modelling enables us to sample from the continuous latent space to generate new sentences or reconstruct the input.

We demonstrate the effectiveness of this model through experiments. The tree-based VAE model is trained on the Stanford Natural Language Inference (SNLI) dataset. First, we compute the BLEU score for the given input to evaluate the reconstruction capability of the model and how the model can preserve the information from the input. This score shows that our proposed model reconstructs the input sentence better than the baseline. Second, random sampling from the latent space is used to evaluate how fluent the generation is. We observe the perplexity, UniKL, and entropy to evaluate the quality of the generated sentences. The results show that the generated sentences are less semantically meaningful. However, the sentences are correct in terms of the syntax and the order of phrases. The reason is that the rules are applied in a way that correct parse trees are generated.
Subjects / Keywords
Graduation date

Fall 2022
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-9xqd-4q59
License

This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Mou, Lili (Computing Science)
- Rafiei, Davood (Computing Science)