Usage
  • 37 views
  • 37 downloads

Application of Machine Learning Towards Compound Identification through Gas Chromatography Retention Index (RI) and Electron Ionization Mass Spectrometry (EI-MS) Predictions

  • Author / Creator
    Anjum, Afia
  • With over 100 million synthetic chemicals and over 1 million biologically-derived compounds known to humans, chemists face significant challenges trying to identify or characterize them. In addition to this large collection of known compounds, analytical chemists, natural product chemists, pharmacologists, toxicologists, are frequently confronted with the challenge of unknown substances. These may arise as a result of biotic, abiotic or spontaneous chemical reactions. Gas chromatography mass spectrometry (GC-MS) is frequently used to identify many of these known and unknown compounds. Key to compound identification via GC-MS, is the accurate and reliable measurement of retention times (which are typically normalized to a retention index or RI) and electron impact mass spectra (EI-MS). Once RIs and EI-MS have been measured it is possible to compare these values to reference RI and EI-MS tables or libraries to identify the known/unknown compound. However, this process can be time consuming, labor-intensive and error-prone. Moreover, existing libraries of RI and EI-MS data are often inadequate, covering <1% of known compounds and, by definition, no unknown compounds. This makes the compound identification task by GC-MS quite daunting. Computational techniques, particularly those involving machine learning, can enhance the compound ID process by predicting RI and EI-MS data using only text representations of chemical structures. Among known ML methods, the most promising results for RI and EI-MS prediction have been achieved using different variants of graph neural networks (GNN). In the case of RI prediction, most RI predictors are not public. Furthermore, they neither incorporate GC column phases nor derivatization type information. This severely limits their utility. In the case of EI-MS predictors, they tend to suffer from either a lack of peak annotations or inaccurate peak intensity prediction. In this thesis, I will first describe my efforts to develop a GNN-based, freely available webserver for RI prediction, called RIpred (https://ripred.ca) that rapidly and accurately predicts GC Kováts retention indices using SMILES strings as the only chemical structure input. RIpred, performs RI prediction for three different stationary GC phases for both trimethylsilyl (TMS) derivatized and underivatized (base compound) compounds. The best performing RIpred model, when tested on hold-out test sets from all stationary phases, achieved a mean absolute percentage error (MAPE) within 3%. Secondly, I will also describe my efforts to develop a GNN-based EI-MS predictor (EI-MSpred) that accepts SMILES strings and generates an EI-MS spectrum. This predictor, which was based on a previously published model called NEIMS, utilizes a molecular ion intensity predictor (MIIP) and a peak annotation program (called PeakAnnotator), to improve its performance. EI-MSpred, when tested on a hold-out test set comprising ~2000 molecules from the NIST23 library, achieved a spectral matching score dot product score of 0.621. In terms of spectral annotation correctness and spectral annotation coverage, EI-MSpred significantly outperformed other existing EI-MS predictors, achieving an average correctness of 91% and an average coverage of 94%, on two held out test sets containing six common compounds and five random NIST23 compounds respectively. Finally, I present evidence showing the effectiveness of combining RIpred and EI-MSpred (i.e., EI-RIpred together) to aid in compound identification by simulating three GC-MS compound identification experiments.

  • Subjects / Keywords
  • Graduation date
    Spring 2024
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-a57x-6m59
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.