Developing and Evaluating Methods for Mitigating Sample Selection Bias in Machine Learning

Pelayo Ramirez, Lourdes

doi:doi:10.7939/R3H05D

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

413 views
513 downloads

Developing and Evaluating Methods for Mitigating Sample Selection Bias in Machine Learning

Author / Creator

Pelayo Ramirez, Lourdes
The imbalanced learning problem occurs in a large number of economic and health domains of great importance; consequently, it has drawn a significant amount of interest from academia, industry, and government funding agencies. Several researchers have used stratification to alleviate this problem; however, it is not clear what stratification strategy is in general more effective: under-sampling, over-sampling or the combination of both. Our first topic evaluates the contribution of stratification strategies in the software defect prediction area. We study the statistical contribution of stratification in the new Mozilla dataset, a new large-scale software defect prediction dataset which includes both object-oriented metrics and a count of defects per module. Our second topic responds to the debate about the contribution of over-sampling, under-sampling and the combination of both with the employment of a full-factorial design experiment using the Analysis of Variance (ANOVA) over six software defect prediction datasets. We extend our research to develop a stratification method to mitigate sample selection bias in function approximation problems. The sample selection bias is present when the training and test instances are drawn from a different distribution, with the imbalance dataset problem considered a particular case of sample selection bias. We extend the well-known SMOTE over-sampling technique to continuous-valued response variables. Our new algorithm proves to be a valuable algorithm helping to increase the performance on function approximation problems and effectively reducing the impact of sample selection bias.
Subjects / Keywords
Graduation date

Fall 2011
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3H05D
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Electrical and Computer Engineering
Supervisor / co-supervisor and their department(s)
- Dick, Scott (Electrical and Computer Engineering)
Examining committee members and their departments
- Pedrycz, Witold (Electrical and Computer Engineering)
- Sutton, Richard (Computer Science)
- Denzinger, Joerg (Computer Science University of Calgary)
- Reformat, Marek (Electrical and Computer Engineering)