Statistical Analysis of Genomic Assays in Complex Study Designs

  • Author / Creator
    Khodayari Moez, Elham
  • Human genomic data are being generated at an increasing rate owing to the advancement of high-throughput technology. Wider availability of genomics, transcriptomics, proteomics and metabolomics data motivated complex study questions with the intention to gain higher degree of understanding of system biology. These study questions inspired novel study designs and demanded compelling statistical analysis. Although beneficial for understanding the disease progression, recently-proposed directions of integrative and longitudinal analysis of multiple omics call for advanced statistical methods. Phenotypes are not determined by merely presence of single or few genes, but by the interconnection of many genes and their downstream pathways. The regulation of human genome at multiple levels may be revealed by integrative analysis of omics and helps the establishment of personalized clinical practices. In our study of prostate cancer, tumor and healthy samples manifested the differential interdependency of oncogene expressions (MYC and AKT1) and metabolite pathways. We showed the inability of classic statistical analysis approaches to deal with this complex design and offered Linear Combination Test (LCT) as a solution for linking genomics and metabolomics, working directly with multiple continuous and correlated measurements. Despite promoting an insight into the temporal progression of the disease and providing more accurate data, the longitudinal design of genetic studies is out of reach for scientists, due to lack of adequate statistical methods that accounts for the within-subject correlation. In this thesis, a Longitudinal Linear Combination Test (LLCT), a self-contained gene set analysis method, is proposed to detect the genes which are differentially expressed in association with different trajectories of one or multiple phenotypes. LLCT is a high-dimensional data analysis method applicable to a wide range of longitudinal omics data. It allows adjusting for potentially time-dependent covariates and works well with unbalanced and incomplete data. An extension of LLCT is applicable to family-based data with an additional layer of correlation between subjects. The reasonable performance of LLCT for different sample sizes, gene set sizes, number of follow-up visits, within-gene-set correlation and within-subject correlation and the outperformance of LLCT compared to other methods were demonstrated in simulation studies. The application study illustrated the adequacy of LLCT to detect genes whose differential expression significantly alters the dynamic of blood pressure in related and unrelated datasets. We also proposed a generalization of LLCT that can handle time-course omics datasets Efforts to investigate the genomic network may be wasted by poorly designed studies and inappropriate analytical tools. The success of genetic investigations depends on the development of comprehensive analysis methods appropriate for complex studies, designed to minimize the potential error and biases in the hope of achieving a greater level of consistency among the study findings.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.