Download the full-sized PDF of Assessing the Feasibility of Learning Biomedical Phenotype Patterns Using High-Throughput Omics ProfilesDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Assessing the Feasibility of Learning Biomedical Phenotype Patterns Using High-Throughput Omics Profiles Open Access


Other title
Computational Learning Theory
Breast Cancer
Machine Learning
Type of item
Degree grantor
University of Alberta
Author or creator
Hajiloo, Mohsen
Supervisor and department
Damaraju, Sambasivarao (Laboratory Medicine and Pathology)
Greiner, Russell (Computing Science)
Examining committee member and department
Jurisica, Igor (Computer Science, University of Toronto)
Stothard, Paul (Agricultural, Food and Nutritional Science)
Wishart, David (Computing Science)
Schuurmans, Dale (Computing Science)
Department of Computing Science

Date accepted
Graduation date
Doctor of Philosophy
Degree level
A decade after the completion of the human genome project, the rapid advancement of the high-throughput measurement technologies has made omics (genomics, epigenomics, transcriptomics, metabolomics) profiling feasible. The availability of such omics profiles has raised the hope for the development of more accurate disease models that will help improve the existing clinical strategies for disease prevention, diagnosis, prognosis, and treatment. Revealing the hidden pattern of diseases based on high-throughput omics profiles is only feasible if we choose the appropriate informatics techniques. While the basic univariate statistical analysis techniques are applicable to some extent within the reductionist paradigm of disease studies, supervised machine learning techniques are relevant in the systems biology paradigm of disease studies. This dissertation utilizes such machine learning techniques and foundations to analyze, experimentally and analytically, the feasibility of learning breast cancer and ancestral origins based on a genome wide scan of single nucleotide polymorphisms. In the former task, using a dataset from Alberta with 696 samples (348 breast cancer cases and 348 controls) over 900K features, we achieved 59.55% leave-one-out cross validation accuracy in breast cancer susceptibility prediction, after examining a wide range of supervised learning methods. In the latter task, using the international HapMap project phase II and III dataset with hundreds of samples with different continental and subcontinental ancestral origins over 900K or 1450K features, we developed a novel learning method, ETHNOPRED, that achieved over 90% 10 fold cross validation accuracies in various continental, and subcontinental population identification problems. Our sample complexity analysis (in the probably approximately correct learning framework) suggests that the ancestral origin prediction task is a case of realizable learning with many irrelevant features and so requires only a relatively small number of instances, while the breast cancer prediction task appears to be a case of unrealizable learning with relevant hidden features and hidden subclasses, explaining why it requires a large number of instances to be learned effectively, which we suspect is why the results here were not as good.
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.
Citation for previous publication
Hajiloo M, Damavandi B, Hooshsadat M, Sangi F, Cass CE, Mackey JR, Greiner R, Damaraju S: Using genome wide single nucleotide polymorphism data to learn a model for breast cancer prediction, BMC Bioinformatics 2013, 14(S13): S3.Hajiloo M, Sapkota Y, Mackey JR, Robson P, Greiner R, Damaraju S: ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction. BMC bioinformatics 2013, 14(1): 61.Hajiloo M: Learning disease patterns from novel high-throughput genomics profiles: why is it so challenging?, Lecture Notes in Computer Science 2013, 7884: 328-333.Hajiloo M, Greiner R: Assessing the feasibility of learning biomedical phenotypes via large scale omics profiles, NIPS Workshop on Machine Learning in Computational Biology (NIPS MLCB), Lake Tahao, USA, December 2013.

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 1495330
Last modified: 2015:10:18 01:35:12-06:00
Filename: Hajiloo_Mohsen_Spring 2014.pdf
Original checksum: 06f5bde89da17c907d80a61f1e38a16d
Well formed: true
Valid: true
File title: Microsoft Word - Mohsen Hajiloo_PhD Dissertation_Final Version_27Dec2013
File author: Mohsen
Page count: 125
Activity of users you follow
User Activity Date