Assessing the Feasibility of Learning Biomedical Phenotype Patterns Using High-Throughput Omics Profiles

Hajiloo, Mohsen

doi:doi:10.7939/R3VQ2SH6R

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

385 views
356 downloads

Assessing the Feasibility of Learning Biomedical Phenotype Patterns Using High-Throughput Omics Profiles

Author / Creator

Hajiloo, Mohsen
A decade after the completion of the human genome project, the rapid advancement of the high-throughput measurement technologies has made omics (genomics, epigenomics, transcriptomics, metabolomics) profiling feasible. The availability of such omics profiles has raised the hope for the development of more accurate disease models that will help improve the existing clinical strategies for disease prevention, diagnosis, prognosis, and treatment. Revealing the hidden pattern of diseases based on high-throughput omics profiles is only feasible if we choose the appropriate informatics techniques. While the basic univariate statistical analysis techniques are applicable to some extent within the reductionist paradigm of disease studies, supervised machine learning techniques are relevant in the systems biology paradigm of disease studies. This dissertation utilizes such machine learning techniques and foundations to analyze, experimentally and analytically, the feasibility of learning breast cancer and ancestral origins based on a genome wide scan of single nucleotide polymorphisms. In the former task, using a dataset from Alberta with 696 samples (348 breast cancer cases and 348 controls) over 900K features, we achieved 59.55% leave-one-out cross validation accuracy in breast cancer susceptibility prediction, after examining a wide range of supervised learning methods. In the latter task, using the international HapMap project phase II and III dataset with hundreds of samples with different continental and subcontinental ancestral origins over 900K or 1450K features, we developed a novel learning method, ETHNOPRED, that achieved over 90% 10 fold cross validation accuracies in various continental, and subcontinental population identification problems. Our sample complexity analysis (in the probably approximately correct learning framework) suggests that the ancestral origin prediction task is a case of realizable learning with many irrelevant features and so requires only a relatively small number of instances, while the breast cancer prediction task appears to be a case of unrealizable learning with relevant hidden features and hidden subclasses, explaining why it requires a large number of instances to be learned effectively, which we suspect is why the results here were not as good.
Subjects / Keywords
Graduation date

Spring 2014
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3VQ2SH6R
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Damaraju, Sambasivarao (Laboratory Medicine and Pathology)
- Greiner, Russell (Computing Science)
Examining committee members and their departments
- Schuurmans, Dale (Computing Science)
- Stothard, Paul (Agricultural, Food and Nutritional Science)
- Jurisica, Igor (Computer Science, University of Toronto)
- Wishart, David (Computing Science)