Prediction and Characterization of DNA and RNA Binding Residues from Protein Sequence: state-of-the-art, novel predictors and proteome-scale analysis

  • Author / Creator
  • Interactions between proteins and DNA/RNA play vital roles in many cellular processes and yet many of them remain to be found and characterized. Many computational methods have been developed to predict from protein sequences which parts of the proteins (so called interacting residues) are involved in these interactions. These methods can be used to find protein-RNA and protein-DNA interactions for the vast number of uncharacterized proteins. We review a comprehensive set of 30 such computational methods. We summarize them from several significant perspectives including their design, outputs and availability. We also perform empirical assessment of a subset of these methods that offer webservers using a new benchmark dataset characterized by a more complete annotation of interactions compared to the existing datasets. We show that the predictors of DNA-binding (RNA-binding) residues offer relatively strong predictive performance but they are unable to properly separate DNA- from RNA-binding residues. This substantial weakness motivates our research. Since the existing methods substantially vary in their architectures and predictions, they can be combined together to build consensuses that perhaps can offer improved predictive performance compared to the individual methods. We design and empirically assess several types of consensuses. We demonstrate that machine learning (ML)-based consensuses provide the improved predictive performance. We also formulate and execute first-of-its-kind study that targets combined prediction of DNA- and RNA-binding residues, with the goal of substantially reducing the cross predictions between DNA and RNA binding residues. We design and test three types of these novel consensuses and conclude that the approach that relies on ML design provides better predictive quality than individual predictors and it also substantially improves discrimination between the two types of nucleic acids. As the only solution to solve the cross-prediction problem, this consensus is hard to use and time consuming to execute, given that it relies on the predictions from 8 methods that require long runtime. To this end, we develop a novel high-throughput method, DRNApred, that accurately and specifically predicts only DNA-binding and only RNA-binding residues from protein sequences. DRNApred is implemented using a new dataset with both DNA- and RNA-binding proteins, weight-based mechanism to penalize cross-predictions, and two-layered architecture. The predictions generated in both layers are based on logistic regression models constructed using a comprehensive set of sequence-derived information. We demonstrate that the novel design ideas utilized in DRNApred raise its predictive quality. DRNApred outperforms the other state-of-the-art representative methods for the prediction of DNA- or RNA-binding residues. Based on empirical test on a test dataset we show that our method substantially reduces the cross predictions. The false positives predicted by DRNApred have higher quality, since they are located nearby the native binding residues. Moreover, DRNApred outperforms the other methods for the prediction of DNA- or RNA-binding proteins. Application in human proteome confirms that DRNApred outperforms the only other runtime efficient existing method that can process such large number of proteins, BindN+, by substantially reducing the cross predictions. We show that the novel putative binding proteins predicted by DRNApred share similarities with the known annotated binding proteins indicating that DRNApred can be used to accurately discover novel DNA and RNA binding proteins in human.

  • Subjects / Keywords
  • Graduation date
    2016-06:Fall 2016
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
    • Department of Electrical and Computer Engineering
  • Specialization
    • Software Engineering and Intelligent Systems
  • Supervisor / co-supervisor and their department(s)
    • Kurgan, Lukasz (Electrical and Computer Engineering)
    • Reformat, Marek (Electrical and Computer Engineering)
  • Examining committee members and their departments
    • Cheng, Jianlin (University of Missouri)
    • Dick, Scott (Electrical and Computer Engineering)
    • Musilek, Petr (Electrical and Computer Engineering)