Assessing the Feasibility of Learning Biomedical Phenotype Patterns Using High-Throughput Omics Profiles

  • Author / Creator
    Hajiloo, Mohsen
  • A decade after the completion of the human genome project, the rapid advancement of the high-throughput measurement technologies has made omics (genomics, epigenomics, transcriptomics, metabolomics) profiling feasible. The availability of such omics profiles has raised the hope for the development of more accurate disease models that will help improve the existing clinical strategies for disease prevention, diagnosis, prognosis, and treatment. Revealing the hidden pattern of diseases based on high-throughput omics profiles is only feasible if we choose the appropriate informatics techniques. While the basic univariate statistical analysis techniques are applicable to some extent within the reductionist paradigm of disease studies, supervised machine learning techniques are relevant in the systems biology paradigm of disease studies. This dissertation utilizes such machine learning techniques and foundations to analyze, experimentally and analytically, the feasibility of learning breast cancer and ancestral origins based on a genome wide scan of single nucleotide polymorphisms. In the former task, using a dataset from Alberta with 696 samples (348 breast cancer cases and 348 controls) over 900K features, we achieved 59.55% leave-one-out cross validation accuracy in breast cancer susceptibility prediction, after examining a wide range of supervised learning methods. In the latter task, using the international HapMap project phase II and III dataset with hundreds of samples with different continental and subcontinental ancestral origins over 900K or 1450K features, we developed a novel learning method, ETHNOPRED, that achieved over 90% 10 fold cross validation accuracies in various continental, and subcontinental population identification problems. Our sample complexity analysis (in the probably approximately correct learning framework) suggests that the ancestral origin prediction task is a case of realizable learning with many irrelevant features and so requires only a relatively small number of instances, while the breast cancer prediction task appears to be a case of unrealizable learning with relevant hidden features and hidden subclasses, explaining why it requires a large number of instances to be learned effectively, which we suspect is why the results here were not as good.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.