Download the full-sized PDF of Detecting, correcting, and preventing the batch effects in multi-site data, with a focus on gene expression MicroarraysDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Detecting, correcting, and preventing the batch effects in multi-site data, with a focus on gene expression Microarrays Open Access


Other title
Gene Expression
Batch Effect
Type of item
Degree grantor
University of Alberta
Author or creator
Vaisipour, Saman
Supervisor and department
Russell Greiner (Computing Science)
Examining committee member and department
Jörg Sander (Computing Science)
Terence Speed (University of California at Berkeley, Department of Statistics)
Dale Schuurmans (Computing Science)
David Wishart (Computing Science)
Department of Computing Science

Date accepted
Graduation date
Doctor of Philosophy
Degree level
Gene expression microarrays are widely used to better understand the complex biological mechanisms inside cells. One of the main obstacles of applying statistical learning algorithms to microarray data is the large gap between the number of features (p) and the number of available instances (n), i.e., the “large p, small n” challenge. This thesis explores two ways to deal with this challenge. One approach is to increase n by combining similarly appropriate microarray data sets together. This is appealing as there are now many publicly available microarray studies. The main problem of this approach is the batch effect, i.e., the influence of non-biological factors on expression intensities that can confound the biological signal. As a result, combining gene expression studies without correcting for batch effects may lead to misleading findings. This thesis proposes a novel batch correction algorithm, called batch effect correction using canonical correlation analysis (BECCA), that assumes the batch effect is due to additive independent confounding factors and so utilizes canonical correlation analysis to separate technical bias from the measured biological signal. We compare BECCA to various existing batch correction algorithms using several real-world gene expression studies and find that BECCA has similar performance. The key advantage of utilizing BECCA, compared to other similar performing algorithms, is its flexibility, as BECCA allows the user to adjust how much common signal to preserve across the batches and how much batch related signal to remove from each one by changing the values of BECCA parameters. The second approach to batch correction considers the wisdom of reducing p by selecting a subset of genes. Our experiments suggest that some genes in microarray data sets contain very little biological signal, i.e., including only these genes in the calculations makes all specimens highly correlated, regardless of their tissue of origin or disease state. It is, therefore, desirable to identify and remove these misleading genes before conducing downstream analysis or batch correction. For this purpose, we propose an efficient algorithm to extend the single-study variance-based gene selection method to a multi-study gene selection algorithm. Our empirical results show this feature selection algorithm outperforms other algorithms in reducing the destructive influence of batch effects.
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.
Citation for previous publication

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 10496131
Last modified: 2015:10:12 16:15:01-06:00
Filename: Vaisipour_Saman_Spring2014.pdf
Original checksum: 95059adcb3b161394852e9ccad6432c5
Well formed: false
Valid: false
Status message: Lexical error offset=10435249
File title: Detecting, correcting, and preventing the batch effects in multi-site data, with a focus on gene expression Microarrays
File author: Saman Vaisipour
Page count: 175
Activity of users you follow
User Activity Date