Some Bioinformatics Studies on SARS-CoV-2

  • Author / Creator
    Mitra, Sangita
  • The ongoing COVID-19 pandemic is impacting the lives of billions of people worldwide as well as the medical and socioeconomic systems. The genomic variability of this virus makes it capable of being prevalent in humans around the world for a long time and migrating from one place to another. It requires a detailed study to understand the trend of SARS-CoV-2 as well as its molecular epidemiology, evolutionary models, and phylogenetic analysis. In this dissertation, we perform several bioinformatics studies on coronaviruses and SARS-CoV-2, focusing on their evolution. The time series analysis on the spike proteins, membrane proteins, and envelope proteins mutations of SARS-CoV-2 are performed to understand how they evolve over time. The spike proteins play a vital role in binding with the human ACE2 receptor. The implication, co-occurrence, and recurrence of spike mutations are investigated. D614G mutation increases infection, and we found in implication analysis that 98% of the time, if D614G mutation occurs, 28 other mutations occur in spike proteins. We got several recurrent mutation pairs in spike proteins that appeared periodically. The relationship of SARS-CoV-2 with two previous outbreaks such as SARS-CoV and MERS-CoV in terms of time series of mutations in spike proteins is analyzed. The mutation rate of six variants of interest and variants of concerns is analyzed to understand the number of mutation change over time. We observed that the COVID-19 pandemic follows some time-series patterns and thus applied the forecasting to predict the upcoming mutations. In this perspective, a prominent long-short term memory network (LSTM) like encoder-decoder LSTM model is applied to predict nucleotide mutations and spike proteins mutations at certain positions of SARS-CoV-2. We propose two bootstrapping techniques as statistical tests to evaluate the model’s performance in general and predict each mutation site. The statistical tests show that our model is highly robust in prediction on most sites despite missing data. The results show that the forecasting is more confident in some biologically significant sites than others insignificant sites.

  • Subjects / Keywords
  • Graduation date
    Fall 2021
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.