Usage
  • 206 views
  • 534 downloads

Developing and Evaluating Algorithms for Fixing Omission and Commission Errors in Structured Data

  • Author / Creator
    Nashaat Ali Elmowafy, Mona
  • The use of machine learning is rapidly rising to deliver a variety of benefits in various domains. However, developing predictive systems often faces many challenges that can drastically delay model deployment. For instance, obtaining labeled training data is one of the most expensive bottlenecks in data preprocessing tasks in machine learning. Therefore, organizations, in many domains, are applying weak supervision to produce noisy labels. However, since weak supervision relies on cheaper sources, the quality of the generated labels is often problematic. Although recent research tries to enable machine learning to work with different types of weak supervision such as noisy and incomplete data, the previous literature treats each type individually without considering the possibility of compound weakly supervised learning.
    Similarly, handling data quality issues in big data has turned into a challenging task. The key characteristics of big data have amplified the harmful impact of data errors. For example, the tremendous rate of data collection, along with the variable nature of big data, has complicated the process of error detection since data has become susceptible to various types of errors. Existing error detection techniques are typically tailored to detect certain types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples.

    Therefore, motivated by these challenges, this research proposes a set of systems to handle the problems of data preparation in real-world situations. First, to design these systems, an extensive experimental study has been conducted to evaluate the effectiveness of existing solutions to real-world data. As for the data labeling challenges, we propose a novel technique in which we combine weak supervision and active learning to solve the labeling problem in large industrial datasets. The proposed system optimizes the labeling process to minimize the annotation cost while incorporating domain expertise in the process.
    Second, to tackle the problem of learning in the presence of weak data, we present a classification algorithm that can handle inaccurate and incomplete supervised datasets. The model exploits the unlabeled data in semi-supervised settings to detect noisy data points. Then, it applies a rectification process to improve the performance of the final classifier.
    Finally, targeted at providing a holistic error detection system for tabula data, we present a self-learning bidirectional encoder representation for tabular data. The system follows the encoder architecture with multi self-attention layers to model the dependencies between data cells and capture tuple-level representations. Once these representations are inferred from the data, the model parameters are fine-tuned with the task of erroneous data detection.
    To evaluate the systems mentioned above, we apply an extensive set of experiments against state-of-the-art techniques. During the experiments, we report different evaluation metrics, including classification performance, human effort, and data quality measures. The empirical results are highly promising and depict that the proposed frameworks can help improve data quality and automate most data preparation processes.

  • Subjects / Keywords
  • Graduation date
    Fall 2020
  • Type of Item
    Thesis
  • Degree
    Doctor of Philosophy
  • DOI
    https://doi.org/10.7939/r3-ztwj-vq19
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.