Accuracy of protein-level disorder predictions

Brief Bioinform. 2020 Sep 25;21(5):1509-1522. doi: 10.1093/bib/bbz100.

Abstract

Experimental annotations of intrinsic disorder are available for 0.1% of 147 000 000 of currently sequenced proteins. Over 60 sequence-based disorder predictors were developed to help bridge this gap. Current benchmarks of these methods assess predictive performance on datasets of proteins; however, predictions are often interpreted for individual proteins. We demonstrate that the protein-level predictive performance varies substantially from the dataset-level benchmarks. Thus, we perform first-of-its-kind protein-level assessment for 13 popular disorder predictors using 6200 disorder-annotated proteins. We show that the protein-level distributions are substantially skewed toward high predictive quality while having long tails of poor predictions. Consequently, between 57% and 75% proteins secure higher predictive performance than the currently used dataset-level assessment suggests, but as many as 30% of proteins that are located in the long tails suffer low predictive performance. These proteins typically have relatively high amounts of disorder, in contrast to the mostly structured proteins that are predicted accurately by all 13 methods. Interestingly, each predictor provides the most accurate results for some number of proteins, while the best-performing at the dataset-level method is in fact the best for only about 30% of proteins. Moreover, the majority of proteins are predicted more accurately than the dataset-level performance of the most accurate tool by at least four disorder predictors. While these results suggests that disorder predictors outperform their current benchmark performance for the majority of proteins and that they complement each other, novel tools that accurately identify the hard-to-predict proteins and that make accurate predictions for these proteins are needed.

Keywords: accuracy; disorder content; intrinsic disorder; intrinsically disordered proteins; intrinsically disordered regions; prediction; predictive performance; protein sequence.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.
  • Review

MeSH terms

  • Algorithms
  • Computational Biology / methods
  • Crystallography, X-Ray
  • Databases, Protein
  • Datasets as Topic
  • Intrinsically Disordered Proteins / chemistry*
  • Nuclear Magnetic Resonance, Biomolecular
  • Protein Conformation
  • Sequence Analysis, Protein / methods

Substances

  • Intrinsically Disordered Proteins