Development and validation of multiple machine learning algorithms for the classification of G-protein-coupled receptors using molecular evolution model-based feature extraction strategy

Amino Acids. 2021 Nov;53(11):1705-1714. doi: 10.1007/s00726-021-03080-x. Epub 2021 Sep 25.

Abstract

Machine learning is one of the most potential ways to realize the function prediction of the incremental large-scale G-protein-coupled receptors (GPCR). Prior research reveals that the key to determining the overall classification accuracy of GPCR is extracting valuable features and filtering out redundancy. To achieve a more efficient classification model, we put the feature synonym problem into consideration and create a new method based on functional word clustering and integration. Through evaluating the evolution correlation between features using the transition scores in mature molecular substitution matrices, candidate features are clustered into synonym groups. Each group of the clustered features is then integrated and represented by a unique key functional word. These retained key functional words are used to form a feature knowledge base. The original GPCR sequences are then transferred into feature vectors based on a feature re-extraction strategy according to the features in the knowledge base before the training and testing stage. We create multiple machine learning models based on Naïve Bayesian (NB), random forest (RF), support vector machine (SVM), and multi-layer perceptron (MLP) algorithms. The established model is applied to classify two public data sets containing 8354 and 12,731 GPCRs, respectively. These models achieve significant performance in almost all evaluation criteria in comparison with state-of-the art. This work demonstrated the potential of the novel feature extraction strategy and provided an effective theoretical design for the hierarchical classification of GPCRs.

Keywords: Artificial neural network; Classification; G-protein-coupled receptors; Machine learning.

Publication types

  • Evaluation Study

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Evolution, Molecular*
  • Machine Learning*
  • Multigene Family
  • Receptors, G-Protein-Coupled / chemistry
  • Receptors, G-Protein-Coupled / genetics*
  • Receptors, G-Protein-Coupled / metabolism
  • Sequence Alignment

Substances

  • Receptors, G-Protein-Coupled