Download the full-sized PDF of The challenge of applying machine learning techniques to diagnose schizophrenia using multi-site fMRI dataDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

The challenge of applying machine learning techniques to diagnose schizophrenia using multi-site fMRI data Open Access


Other title
batch effects
machine learning
Type of item
Degree grantor
University of Alberta
Author or creator
Vega Romero, Roberto I
Supervisor and department
Brown, Matthew (Psychiatry)
Greiner, Russell (Computing Science)
Examining committee member and department
Schuurmans, Dale (Computing Science)
Brown, Matthew (Psychiatry)
Greiner, Russell (Computing Science)
Pierre Boulanger (Computing Science)
Department of Computing Science

Date accepted
Graduation date
2017-06:Spring 2017
Master of Science
Degree level
One of the main challenges for the use of machine learning techniques in neuroimaging data is the small n, large p problem. Datasets usually contain only a few hundreds of instances (n), each of which is described using hundreds of thousands of features (p). In this dissertation, we explore the effects of reducing the number of features by analyzing 264 specific regions of interest of the brain, and increasing the number of instances by merging imaging data obtained from different scanning sites for distinguishing people with schizophrenia from healthy controls. Empirical results show that, using features related to functional connectivity of the brain, we can achieve an accuracy above the chance level (over 70 %), when using data from a single scanning site for both training and testing. However, this performance decreases when additional data from a different scanning site is used as part of the training process. We attribute the decrease in performance to batch effects: technical noise introduced at different scanning sites that confound the biological signal of interest. Batch effects are often disregarded in association studies because there is often no statistically significant interaction between the scanning site and the variables being analyzed. In this work, we highlight important differences between association studies and prediction studies, and we argue that in the latter, batch effects matter. Our experiments reveal that not taking them into account reduces the performance of a learned classifier compared to using data from a single scanning site, even though this drastically reduces the size of the training set. In addition, we can create a classifier that can distinguish among sites (not case vs control) with an accuracy > 80 %. We empirically show that if the same subjects are scanned in two different sites, then a neural network that maps the fMRI scan from one scanner into another is enough for correcting the batch effects. In more realistic situations, involving disjoint set of subjects, simple techniques like z-score normalization or whitening can remove batch effects caused by translations and scaling, or translations and rotations of the feature matrix. Both approaches proved successful in reducing the accuracy of scanning site classification to near chance level, but they were unable to improve the accuracy of schizophrenia diagnosis using multisite data. This is a strong indication that batch effects go beyond these simple linear transformations. Finally, we explored the use of BECCA (batch effects correction using canonical correlation analysis) and approaches based on autoencoders for decreasing the influence of batch effects. These attempts were also unsuccessful under our test scenarios, suggesting that batch effects is a serious problem in prediction studies using fMRI data, and that more effort should be taken to understand their nature in order to reduce their influence.
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 2832978
Last modified: 2017:06:13 12:17:00-06:00
Filename: VegaRomero_Roberto_I_201701_MSc.pdf
Original checksum: 45fd1e9228e9eca9249dfbaa82ff7426
Well formed: false
Valid: false
Status message: Unexpected error in findFonts java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary offset=2918
Page count: 89
Activity of users you follow
User Activity Date