eThread: a highly optimized machine learning-based approach to meta-threading and the modeling of protein tertiary structures

PLoS One. 2012;7(11):e50200. doi: 10.1371/journal.pone.0050200. Epub 2012 Nov 21.

Abstract

Template-based modeling that employs various meta-threading techniques is currently the most accurate, and consequently the most commonly used, approach for protein structure prediction. Despite the evident progress in this field, accurate structure models cannot be constructed for a significant fraction of gene products, thus the development of new algorithms is required. Here, we describe the development, optimization and large-scale benchmarking of eThread, a highly accurate meta-threading procedure for the identification of structural templates and the construction of corresponding target-to-template alignments. eThread integrates ten state-of-the-art threading/fold recognition algorithms in a local environment and extensively uses various machine learning techniques to carry out fully automated template-based protein structure modeling. Tertiary structure prediction employs two protocols based on widely used modeling algorithms: Modeller and TASSER-Lite. As a part of eThread, we also developed eContact, which is a Bayesian classifier for the prediction of inter-residue contacts and eRank, which effectively ranks generated multiple protein models and provides reliable confidence estimates as structure quality assessment. Excluding closely related templates from the modeling process, eThread generates models, which are correct at the fold level, for >80% of the targets; 40-50% of the constructed models are of a very high quality, which would be considered accurate at the family level. Furthermore, in large-scale benchmarking, we compare the performance of eThread to several alternative methods commonly used in protein structure prediction. Finally, we estimate the upper bound for this type of approach and discuss the directions towards further improvements.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Artificial Intelligence*
  • Bayes Theorem
  • Computer Simulation
  • Databases, Protein
  • Models, Molecular
  • Protein Folding
  • Protein Structure, Tertiary
  • Proteins / chemistry*
  • Sequence Alignment
  • Software*
  • Structural Homology, Protein

Substances

  • Proteins

Grants and funding

This study was supported by the Louisiana Board of Regents through the Board of Regents Support Fund [contract LEQSF(2012-15)-RD-A-05]. Portions of this research were conducted with high performance computational resources provided by Louisiana State University (HPC@LSU, http://www.hpc.lsu.edu) and the Louisiana Optical Network Institute (LONI, http://www.loni.org). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.