Developing and Evaluating Methods for Mitigating Sample Selection Bias in Machine Learning

  • Author / Creator
    Pelayo Ramirez, Lourdes
  • The imbalanced learning problem occurs in a large number of economic and health domains of great importance; consequently, it has drawn a significant amount of interest from academia, industry, and government funding agencies. Several researchers have used stratification to alleviate this problem; however, it is not clear what stratification strategy is in general more effective: under-sampling, over-sampling or the combination of both. Our first topic evaluates the contribution of stratification strategies in the software defect prediction area. We study the statistical contribution of stratification in the new Mozilla dataset, a new large-scale software defect prediction dataset which includes both object-oriented metrics and a count of defects per module. Our second topic responds to the debate about the contribution of over-sampling, under-sampling and the combination of both with the employment of a full-factorial design experiment using the Analysis of Variance (ANOVA) over six software defect prediction datasets. We extend our research to develop a stratification method to mitigate sample selection bias in function approximation problems. The sample selection bias is present when the training and test instances are drawn from a different distribution, with the imbalance dataset problem considered a particular case of sample selection bias. We extend the well-known SMOTE over-sampling technique to continuous-valued response variables. Our new algorithm proves to be a valuable algorithm helping to increase the performance on function approximation problems and effectively reducing the impact of sample selection bias.

  • Subjects / Keywords
  • Graduation date
    Fall 2011
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.