Clustering Tandem Mass Spectra for Better Peptide Identification

  • Author / Creator
    Panda, Megha
  • Identifying the peptide sequence from a mass spectrum is done either by database search or De novo peptide sequencing. This thesis focuses on identification of peptides by using database search, which is a process where an MS/MS spectrum is searched against an entire database of spectra representing peptides of known proteins, in order to identify an exact match or a match with a spectrum of a homologous peptide. In a mass spectrometry experiment, a database search has two notable challenges: the large size of the sequence database (search space) and the volume of mass spectra generated during an experiment (millions of spectra), each of which has to be searched against the database. The output of a database search depends on the quality of the spectrum being searched and on whether the spectrum of the peptide sequence is in the target database.

    One of the ways to address this problem is to use clustering as a preprocessing method. The past literature has shown that for a mass spectrometry experiment clustering decreases the time taken to perform a database search and increases the number of acceptable identifications for mass spectra. Clustering reduces the number of spectra undergoing database search by replacing a large amount of MS/MS spectra with a smaller number of cluster representatives. It boosts the signal-to-noise ratio (SNR), leading to the identification of one strong spectrum rather than many unidentified weak spectra.

    In this dissertation, we apply various clustering techniques to data obtained from Tandem Mass Spectrometry and study how it affects the number of acceptable peptide identifications. To improve peptide identification over previous work, we propose a new way to extract clusters from HDBSCAN* hierarchies. We experimentally show that this approach outperforms
    previous work in this area and performs comparably with other clustering techniques from the data mining literature.

    We also study well-known cluster validation techniques to identify good parameter values for the different clustering algorithms and show that these approaches, unfortunately, do not work well in the context of peptide identification.

  • Subjects / Keywords
  • Graduation date
    Fall 2018
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.