Fast and accurate computational prediction of functions of intrinsic disorder in proteins

  • Author / Creator
    Meng, Fanchi
  • Intrinsically disordered regions (IDRs) in proteins lack stable three dimensional structure under physiological conditions. IDRs are prevalent in nature, functionally important, and difficult to characterize experimentally due to their unstructuredness. As a result, many computational methods have been developed to detect IDRs from protein sequences in the past forty years. However, the annotations of functions of IDRs lag behind the rapidly accumulating number of newly discovered proteins since they remain largely determined via time-consuming and costly experiments. Sequence alignment (SA) and existing predictors of functions of IDRs provide a way to characterize functions of IDRs. However, SA is only applicable when the protein under study shares sufficiently high sequence similarity with annotated homologous sequences, and existing predictors cover only a small portion of all functions of IDRs. We use SA and existing predictors to characterize functions of IDRs in human dengue virus, and we use this project to investigate the ability of these approaches to determine functions of IDRs. Results show that SA is able to find certain functions that are related to IDRs, but it under predicts the number of IDRs that carry out given functions. Moreover, existing predictors of functions of IDRs only cover protein-binding functions, and do not cover other types of functions. To this end, we address the prediction of the most prevalent function that does not involve binding and cannot be predicted by current predictors, i.e., the disordered flexible linkers (DFLs). DFLs are IDRs that serve as flexible linkers/spacers in multi-domain proteins or between structured constituents in domains. We conceptualized, developed and empirically assessed a first-of-its-kind sequence-based predictor of DFLs, DFLpred. DFLpred uses a set of empirically selected features that quantify propensities to form certain secondary structures, disordered regions and structured regions, which are processed by a fast linear model. DFLpred secures area under the ROC curve (AUC) equal 0.715, is significantly better than the existing alternatives, and it is fast enough to be used on the whole proteome scale. We also address the prediction of IDRs that carry out multiple functions, i.e., disordered moonlighting regions (DMRs). We conceptualized, designed and empirically evaluated a first-of-its-kind sequence based predictor of DMRs, DMRpred. We developed novel amino acid indices that quantify propensities for functions relevant to DMRs and used evolutionary conservation, putative solvent accessibility and intrinsic disorder derived from the input sequence to build a rich profile that is suitable to accurately predict DMRs. We processed this profile to derive innovative features that are input into a Random Forest model to generate the predictions. DMRpred secures AUC = 0.86 and accuracy = 82%. We demonstrate that these results are significantly better than the results from alternative methods. DMRpred is fast and can finish a prediction for a protein of typical length of about 500 residues in less than one minute. We provide convenient webservers to make DFLpred and DMRpred available to the research community. To sum up, motivated by the drawbacks of the current computational approaches for the functional characterization of IDRs, we contribute two novel methods that provide accurate predictions of important functional types of IDRs.

  • Subjects / Keywords
  • Graduation date
    Spring 2018
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.