Improving protein fold recognition by random forest

BMC Bioinformatics. 2014;15 Suppl 11(Suppl 11):S14. doi: 10.1186/1471-2105-15-S11-S14. Epub 2014 Oct 21.

Abstract

Background: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds.

Results: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels.

Conclusions: The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Artificial Intelligence*
  • Decision Trees
  • Protein Folding*
  • Protein Structure, Tertiary*
  • Software