Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable

Myron Peto; Andrzej Kloczkowski; Vasant Honavar; Robert L Jernigan

doi:10.1186/1471-2105-9-487

Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable

BMC Bioinformatics. 2008 Nov 18:9:487. doi: 10.1186/1471-2105-9-487.

Authors

Myron Peto¹, Andrzej Kloczkowski, Vasant Honavar, Robert L Jernigan

Affiliation

¹ Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011-3020, USA. myron.peto@ars.usda.gov

Abstract

Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.

Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly- or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.

Conclusion: By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Artificial Intelligence*
Bayes Theorem
Computational Biology / methods*
Computer Simulation
Databases, Protein
Models, Molecular
Models, Statistical
Protein Conformation
Protein Folding
Proteins / chemistry*
Proteins / classification
ROC Curve
Reproducibility of Results
Sequence Analysis, Protein / methods*

Substances

Proteins

Abstract

Publication types

MeSH terms

Substances

Grants and funding