Prediction and Characterization of DNA and RNA Binding Residues from Protein Sequence: state-of-the-art, novel predictors and proteome-scale analysis

Yan,Jing

doi:doi:10.7939/R3TQ5RQ8T

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

295 views
471 downloads

Prediction and Characterization of DNA and RNA Binding Residues from Protein Sequence: state-of-the-art, novel predictors and proteome-scale analysis

Author / Creator

Yan,Jing
Interactions between proteins and DNA/RNA play vital roles in many cellular processes and yet many of them remain to be found and characterized. Many computational methods have been developed to predict from protein sequences which parts of the proteins (so called interacting residues) are involved in these interactions. These methods can be used to find protein-RNA and protein-DNA interactions for the vast number of uncharacterized proteins. We review a comprehensive set of 30 such computational methods. We summarize them from several significant perspectives including their design, outputs and availability. We also perform empirical assessment of a subset of these methods that offer webservers using a new benchmark dataset characterized by a more complete annotation of interactions compared to the existing datasets. We show that the predictors of DNA-binding (RNA-binding) residues offer relatively strong predictive performance but they are unable to properly separate DNA- from RNA-binding residues. This substantial weakness motivates our research. Since the existing methods substantially vary in their architectures and predictions, they can be combined together to build consensuses that perhaps can offer improved predictive performance compared to the individual methods. We design and empirically assess several types of consensuses. We demonstrate that machine learning (ML)-based consensuses provide the improved predictive performance. We also formulate and execute first-of-its-kind study that targets combined prediction of DNA- and RNA-binding residues, with the goal of substantially reducing the cross predictions between DNA and RNA binding residues. We design and test three types of these novel consensuses and conclude that the approach that relies on ML design provides better predictive quality than individual predictors and it also substantially improves discrimination between the two types of nucleic acids. As the only solution to solve the cross-prediction problem, this consensus is hard to use and time consuming to execute, given that it relies on the predictions from 8 methods that require long runtime. To this end, we develop a novel high-throughput method, DRNApred, that accurately and specifically predicts only DNA-binding and only RNA-binding residues from protein sequences. DRNApred is implemented using a new dataset with both DNA- and RNA-binding proteins, weight-based mechanism to penalize cross-predictions, and two-layered architecture. The predictions generated in both layers are based on logistic regression models constructed using a comprehensive set of sequence-derived information. We demonstrate that the novel design ideas utilized in DRNApred raise its predictive quality. DRNApred outperforms the other state-of-the-art representative methods for the prediction of DNA- or RNA-binding residues. Based on empirical test on a test dataset we show that our method substantially reduces the cross predictions. The false positives predicted by DRNApred have higher quality, since they are located nearby the native binding residues. Moreover, DRNApred outperforms the other methods for the prediction of DNA- or RNA-binding proteins. Application in human proteome confirms that DRNApred outperforms the only other runtime efficient existing method that can process such large number of proteins, BindN+, by substantially reducing the cross predictions. We show that the novel putative binding proteins predicted by DRNApred share similarities with the known annotated binding proteins indicating that DRNApred can be used to accurately discover novel DNA and RNA binding proteins in human.
Subjects / Keywords
- RNA-binding prediction
- DNA-binding prediction
Graduation date

Fall 2016
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3TQ5RQ8T
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Electrical and Computer Engineering
Specialization
- Software Engineering and Intelligent Systems
Supervisor / co-supervisor and their department(s)
- Kurgan, Lukasz (Electrical and Computer Engineering)
- Reformat, Marek (Electrical and Computer Engineering)
Examining committee members and their departments
- Musilek, Petr (Electrical and Computer Engineering)
- Cheng, Jianlin (University of Missouri)
- Dick, Scott (Electrical and Computer Engineering)