Download the full-sized PDF of Survival Prediction using Gene Expression Data - A Topic Modeling ApproachDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Survival Prediction using Gene Expression Data - A Topic Modeling Approach Open Access


Other title
Personalised survival prediction
High Dimensional Data
Gene expression
Topic Modeling
Survival Prediction
Latent Dirichlet Allocation (LDA)
Type of item
Degree grantor
University of Alberta
Author or creator
Kumar, Luke N
Supervisor and department
Greiner, Russell (Computing Science)
Examining committee member and department
Schuurmans, Dale (Computing Science)
Wishart, David (Computing Science, Biological Sciences)
Department of Computing Science

Date accepted
Graduation date
2017-06:Spring 2017
Master of Science
Degree level
Survival prediction is becoming a crucial part of treatment planning for most terminally ill patients. Many believe that genomic data will enable us to better estimate survival of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models cannot cope with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic is a distribution over the words; here, to accommodate the heterogeneity of a patient's cancer, we represent each patient (document) as a mixture over ``(cancer) strains'' (topics), where each strain is a mixture over gene expression values (words). After using our novel discretized Latent Dirichlet Allocation(dLDA) procedure to learn these strains, we can then express each patient as a distribution over a small number of strains, then use this distribution as input to a learning algorithm. We then ran a recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. Here, we focus on the METABRIC dataset, which describes each of n=1,981 breast cancer patients, using k=49,576 gene expression values. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance Index, as well as a relevant novel measure, D-calibration. We then validate this approach on the n=1082 TCGA BRCA dataset, over k=20532 gene expression values.
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (PDF/A)
Mime type: application/pdf
File size: 921960
Last modified: 2017:11:08 17:45:38-07:00
Filename: Kumar_Luke_N_201612_MSc.pdf
Original checksum: ea8839f03934cfa013fb4c230d2a393e
Well formed: true
Valid: true
Page count: 64
Activity of users you follow
User Activity Date