Combining Variational Sampling and Metropolis--Hastings Sampling for Paraphrase Generation

  • Author / Creator
    Hejazizo, Ali
  • Paraphrasing involves changing the expression of a sentence and rewording it to inform the same information as the original sentence and can occur at word-level, phrase-level, or sentence-level. Paraphrasing task has been attracting attention in recent years as several natural language processing (NLP) applications such as question answering, information extraction, information retrieval, and summarization benefit from the success of automatic paraphrase generation. Researchers have developed various paraphrase generation techniques, including knowledge-based approaches, supervised data-driven approaches, and unsupervised data-driven approaches. Knowledge-based approaches are labor intensive and do not generalize well and supervised approaches require massive parallel corpora of pairs of sentences and paraphrases.

    In this work, we propose an unsupervised paraphrasing technique that works in word-level as well as phrase-level for sentence-level paraphrase generation. Existing work either samples directly from sentence space or from a variational latent space while our work combines them both. We show the drawbacks and difficulties of techniques that work at word-level only and propose a technique consisting of three word-level operations (word replacement, word deletion, and word insertion) and a novel phrase-level paraphrasing operation (phrase replacement). The three word-level operations sample directly from the sentence space while our phrase-level operation samples from the latent space of a variational autoencoder (VAE) trained on phrases.

    We perform paraphrase generation iteratively with the objective of generating paraphrases that are 1) fluent, and 2) close in semantic information to the input sentence. We use Metropolis--Hastings (MH) algorithm, a Markov Chain Monte Carlo (MCMC) algorithm, to sample from sentence space and latent space. In each iteration, we randomly select a word/phrase and an operation to form a proposal and use MH to accept or reject the proposal and generate a paraphrase.

    We show the effectiveness of our approach with a series of experiments. First, we train a VAE using Stanford Natural Language Inference (SNLI) dataset and Quora dataset for phrase replacement operation. Second, we evaluate our approach on the Quora dataset including 139k pairs of questions and paraphrases using iBLEU score as our main evaluation metric. The results show that our novel phrase replacement operation improves the quality of paraphrases when compared with techniques paraphrasing by direct word-level sampling only. We show our phrase-level operation can effectively edit multiple words at a time and generate high quality paraphrases. We also discuss the difficulties of evaluation with iBLEU score and VAE training.

  • Subjects / Keywords
  • Graduation date
    Spring 2021
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.