Incorporating secondary structural features into sequence information for predicting protein structural class

Bo Liao; Ting Peng; Haowen Chen; Yaping Lin

doi:10.2174/09298665113209990002

Incorporating secondary structural features into sequence information for predicting protein structural class

Protein Pept Lett. 2013 Oct;20(10):1079-87. doi: 10.2174/09298665113209990002.

Authors

Bo Liao¹, Ting Peng, Haowen Chen, Yaping Lin

Affiliation

¹ College of Information science and Engineering, Hunan University, Changsha, Hunan, 410082, China. dragonbw@163.com

PMID: 23688152
DOI: 10.2174/09298665113209990002

Abstract

Knowledge of structural classes is applied in numerous important predictive tasks that address structural and functional features of proteins, although the prediction accuracy of the protein structural classes is not high. In this study, 45 different features were rationally designed to model the differences between protein structural classes, among which, 30 of them reflect the combined protein sequence information. In terms of correlation function, the protein sequence can be converted to a digital signal sequence, from which we can generate 20 discrete Fourier spectrum numbers. According to the segments of amino with different characteristics occurring in protein sequences, the frequencies of the 10 kinds of segments of amino acid (motifs) in protein are calculated. Other features include the secondary structural information :10 features were proposed to model the strong adjacent correlations in the secondary structural elements and capture the long-range spatial interactions between secondary structures, other 5 features were designed to differentiate α/β from α+β classes , which is a major problem of the existing algorithm. The methods were proposed based on a large set of low-identity sequences for which secondary structure is predicted from their sequence (based on PSI-PRED). By means of this method, the overall prediction accuracy of four benchmark datasets were all improved. Especially for the dataset FC699, 25PDB and D1189 which are 1.26%, 1% and 0.85% higher than the best previous method respectively.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Amino Acid Sequence
Databases, Protein
Protein Structure, Secondary
Proteins / chemistry*
Proteins / classification

Substances

Proteins