Predicting Protein Secondary Structure Using Consensus Data Mining (CDM) Based on Empirical Statistics and Evolutionary Information

Gaurav Kandoi; Sumudu P Leelananda; Robert L Jernigan; Taner Z Sen

doi:10.1007/978-1-4939-6406-2_4

Predicting Protein Secondary Structure Using Consensus Data Mining (CDM) Based on Empirical Statistics and Evolutionary Information

Methods Mol Biol. 2017:1484:35-44. doi: 10.1007/978-1-4939-6406-2_4.

Authors

Gaurav Kandoi^{1

2}, Sumudu P Leelananda³, Robert L Jernigan^{1

4}, Taner Z Sen^{5

6}

Affiliations

¹ Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA.
² Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, USA.
³ Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, USA.
⁴ Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA.
⁵ Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA. taner@iastate.edu.
⁶ Department of Genetics, Development and Cell Biology, Iowa State University, 1025 Crop Genome Informatics Lab, Ames, IA, 50011, USA. taner@iastate.edu.

PMID: 27787818
DOI: 10.1007/978-1-4939-6406-2_4

Abstract

Predicting the secondary structure of a protein from its sequence still remains a challenging problem. The prediction accuracies remain around 80 %, and for very diverse methods. Using evolutionary information and machine learning algorithms in particular has had the most impact. In this chapter, we will first define secondary structures, then we will review the Consensus Data Mining (CDM) technique based on the robust GOR algorithm and Fragment Database Mining (FDM) approach. GOR V is an empirical method utilizing a sliding window approach to model the secondary structural elements of a protein by making use of generalized evolutionary information. FDM uses data mining from experimental structure fragments, and is able to successfully predict the secondary structure of a protein by combining experimentally determined structural fragments based on sequence similarities of the fragments. The CDM method combines predictions from GOR V and FDM in a hierarchical manner to produce consensus predictions for secondary structure. In other words, if sequence fragment are not available, then it uses GOR V to make the secondary structure prediction. The online server of CDM is available at http://gor.bb.iastate.edu/cdm/ .

Keywords: Consensus data mining; Fragment database mining; GOR; Machine learning; Multiple sequence alignments; Protein structure prediction; Secondary structure.

MeSH terms

Algorithms
Amino Acid Sequence / genetics
Data Mining
Protein Structure, Secondary / genetics*
Proteins / chemistry
Proteins / genetics*
Sequence Alignment / methods
Software*

Substances

Proteins