Clustering Tandem Mass Spectra for Better Peptide Identification

Panda, Megha

doi:doi:10.7939/R3WW77G35

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

203 views
507 downloads

Clustering Tandem Mass Spectra for Better Peptide Identification

Author / Creator

Panda, Megha
Identifying the peptide sequence from a mass spectrum is done either by database search or De novo peptide sequencing. This thesis focuses on identification of peptides by using database search, which is a process where an MS/MS spectrum is searched against an entire database of spectra representing peptides of known proteins, in order to identify an exact match or a match with a spectrum of a homologous peptide. In a mass spectrometry experiment, a database search has two notable challenges: the large size of the sequence database (search space) and the volume of mass spectra generated during an experiment (millions of spectra), each of which has to be searched against the database. The output of a database search depends on the quality of the spectrum being searched and on whether the spectrum of the peptide sequence is in the target database.

One of the ways to address this problem is to use clustering as a preprocessing method. The past literature has shown that for a mass spectrometry experiment clustering decreases the time taken to perform a database search and increases the number of acceptable identifications for mass spectra. Clustering reduces the number of spectra undergoing database search by replacing a large amount of MS/MS spectra with a smaller number of cluster representatives. It boosts the signal-to-noise ratio (SNR), leading to the identification of one strong spectrum rather than many unidentified weak spectra.

In this dissertation, we apply various clustering techniques to data obtained from Tandem Mass Spectrometry and study how it affects the number of acceptable peptide identifications. To improve peptide identification over previous work, we propose a new way to extract clusters from HDBSCAN* hierarchies. We experimentally show that this approach outperforms
previous work in this area and performs comparably with other clustering techniques from the data mining literature.

We also study well-known cluster validation techniques to identify good parameter values for the different clustering algorithms and show that these approaches, unfortunately, do not work well in the context of peptide identification.
Subjects / Keywords
Graduation date

Fall 2018
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/R3WW77G35
License

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Sander, Joerg (Computing Science)