A two-stage approach towards protein secondary structure classification

Kushal Kanti Ghosh; Soulib Ghosh; Sagnik Sen; Ram Sarkar; Ujjwal Maulik

doi:10.1007/s11517-020-02194-w

A two-stage approach towards protein secondary structure classification

Med Biol Eng Comput. 2020 Aug;58(8):1723-1737. doi: 10.1007/s11517-020-02194-w. Epub 2020 May 29.

Authors

Kushal Kanti Ghosh¹, Soulib Ghosh², Sagnik Sen², Ram Sarkar², Ujjwal Maulik²

Affiliations

¹ Department of Computer Science and Engineering, Jadavpur University, Kolkata, India. kushalkanti1999@gmail.com.
² Department of Computer Science and Engineering, Jadavpur University, Kolkata, India.

PMID: 32472446
DOI: 10.1007/s11517-020-02194-w

Abstract

Protein secondary structure (PSS) describes the local folded structures which get formed inside a polypeptide due to interactions among atoms of the backbone. Generally, globular proteins are divided into four classes, namely all-α, all-β, α + β, and α/β. As nearly 90% of proteins fall into the said four classes, these are mostly considered for the purpose of computational classification of proteins. Classification of PSS is important for different biological functions that include protein fold recognition, tertiary structure prediction, prediction of DNA-binding sites, and reduction of the conformation search space among others. In this paper, we have proposed a machine learning-based model for secondary structure classification of proteins into four classes: all-α, all-β, α + β, and α/β. In doing so, we have considered both sequence-based and structure-based features. At first, mutual information (MI), a filter-based feature selection method, is used to remove the redundant features, and then these selected features are used to train three different classifiers-random forest, K-nearest neighbor (KNN), and multi-layer perceptron (MLP). After that, some standard classifier combination approaches are applied to integrate the decision made by the said classifiers and it has been found that weighted product rule performs the best among all. The overall accuracies obtained using the proposed model on the four standard datasets, namely 640, 1189, 25pdb, and fc699 are 86.89%, 92.93%, 91.38%, and 94.87% respectively. The proposed model outperforms some state-of-the-art methods considered here for comparison. Significantly high classification accuracy produced by our proposed model on four datasets is attributed to the development of a comprehensive feature set (by eliminating redundant features through feature selection technique) which is then passed through an ensemble consists of three different classifiers. Assigning different weights to the outcome of different classifiers thus proved to be useful in designing the model for predicting the secondary structure of proteins based on its sequence-based and structure-based features. Graphical abstract.

Keywords: Classifier combination; Feature selection; Protein; Protein sequence; Secondary structure.

MeSH terms

Databases, Protein
Machine Learning
Neural Networks, Computer
Peptides / chemistry
Protein Structure, Secondary
Proteins / chemistry*

Substances

Peptides
Proteins