Detecting, correcting, and preventing the batch effects in multi-site data, with a focus on gene expression Microarrays

Vaisipour, Saman

doi:doi:10.7939/R3028PN49

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

409 views
413 downloads

Detecting, correcting, and preventing the batch effects in multi-site data, with a focus on gene expression Microarrays

Author / Creator

Vaisipour, Saman
Gene expression microarrays are widely used to better understand the complex biological mechanisms inside cells. One of the main obstacles of applying statistical learning algorithms to microarray data is the large gap between the number of features (p) and the number of available instances (n), i.e., the “large p, small n” challenge. This thesis explores two ways to deal with this challenge.
One approach is to increase n by combining similarly appropriate microarray data sets together. This is appealing as there are now many publicly available microarray studies. The main problem of this approach is the batch effect, i.e., the influence of non-biological factors on expression intensities that can confound the biological signal. As a result, combining gene expression studies without correcting for batch effects may lead to misleading findings.
This thesis proposes a novel batch correction algorithm, called batch effect correction using canonical correlation analysis (BECCA), that assumes the batch effect is due to additive independent confounding factors and so utilizes canonical correlation analysis to separate technical bias from the measured biological signal. We compare BECCA to various existing batch correction algorithms using several real-world gene expression studies and find that BECCA has similar performance. The key advantage of utilizing BECCA, compared to other similar performing algorithms, is its flexibility, as BECCA allows the user to adjust how much common signal to preserve across the batches and how much batch related signal to remove from each one by changing the values of BECCA parameters.
The second approach to batch correction considers the wisdom of reducing p by selecting a subset of genes. Our experiments suggest that some genes in microarray data sets contain very little biological signal, i.e., including only these genes in the calculations makes all specimens highly correlated, regardless of their tissue of origin or disease state. It is, therefore, desirable to identify and remove these misleading genes before conducing downstream analysis or batch correction. For this purpose, we propose an efficient algorithm to extend the single-study variance-based gene selection method to a multi-study gene selection algorithm. Our empirical results show this feature selection algorithm outperforms other algorithms in reducing the destructive influence of batch effects.
Subjects / Keywords
Graduation date

Spring 2014
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3028PN49
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Russell Greiner (Computing Science)
Examining committee members and their departments
- Jörg Sander (Computing Science)
- David Wishart (Computing Science)
- Terence Speed (University of California at Berkeley, Department of Statistics)
- Dale Schuurmans (Computing Science)