Usage
  • 286 views
  • 330 downloads

Unsupervised Mining and Summarization of Polarized Contentious Issues from Online Text

  • Author / Creator
    Trabelsi, Amine
  • This thesis seeks to contribute to the ongoing research on opinion mining. The contributions are related to the development of newly conceived models for discovery of the viewpoints, and the reasons supporting them, from various polarized contentious texts found in surveys' responses, debate websites, and editorials.

    This research proposes a purely unsupervised approach without the need for annotated large data or any type of external guidance. It deals only with raw documents consisting of real and unstructured social media text. In this respect, we first suggest a novel Joint Topic Viewpoint (JTV) Bayesian probabilistic model and a modified clustering algorithm to automatically generate idiosyncratic and informative patterns of associated terms denoting a vocabulary for a specific reason. Terms are clustered according to the hidden topics that they discuss and the embedded viewpoint that they voice. The coherence of the distinct reasons' lexicons is shown to be of a high quality. The performance of JTV in clustering exceeds that of state-of-the-art and baseline methods. This out-performance is reiterated for six datasets associated with three different types of contentious documents.

    Moreover, we formulate a purely unsupervised Author Interaction Topic Viewpoint model (AITV) at the post and the discourse levels. AITV integrates not just the content of the posts, like JTV, but also the reply information about the authors' interactions. The model assumes heterophily when encoding the nature of the authors’ interactions. Heterophily suggests that the difference in viewpoints breeds interactions. We evaluate the model’s viewpoint identification and clustering accuracies at the author and post levels. Experiments are run on six corpora about four different controversial issues, extracted from two online debate forums. AITV’s results show a higher performance in terms of viewpoint identification at the post-level than the state-of-the-art supervised methods in terms of stance prediction. It also outperforms a recently proposed topic model for viewpoint discovery in social networks and achieves close results to a weakly guided unsupervised method in terms of author-level viewpoint identification. Our results highlight the importance of encoding heterophily for purely unsupervised viewpoint identification in the context of online debates.

    Finally, we design a generic pipeline framework to effectively produce a contrastive textual summary of the main viewpoints given by each of the opposed sides in the form of a fine-grained digest table. The digest table is a realization of the process of automatic extraction and display of the major distinct reasons put forward in the text, according to their topics or facets of argumentation and to their divergent viewpoints. The modular pipeline framework contains a phrase mining, a Topic Viewpoint, and reasons extraction modules. A Phrase Author Interaction Topic Viewpoint model PhAITV is suggested as pipeline component, extending AITV, which jointly processes phrases of different length, instead of just unigrams, and leverages the interaction of authors in online debates. An extensive evaluation of the final produced table is conducted on text about issues extracted from different forums. The evaluation procedure is based on three measures: the informativeness of the digest table as a summary, the relevance of extracted sentences as reasons and the accuracy of their viewpoint clustering. The results on different issues show that our pipeline improves significantly over two state-of-the-art methods and several baselines when measured in terms of documents' summarization, reasons' retrieval, and viewpoint clustering.

  • Subjects / Keywords
  • Graduation date
    Fall 2018
  • Type of Item
    Thesis
  • Degree
    Doctor of Philosophy
  • DOI
    https://doi.org/10.7939/R36T0HC1M
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.