[Method to Generate Complex Predictive Features for Machine Learning-Based Prediction of the Local Structure and Functions of Proteins]

Mol Biol (Mosk). 2023 Jan-Feb;57(1):127-138. doi: 10.31857/S0026898423010093.
[Article in Russian]

Abstract

Recently, prediction of the structure and function of a protein from its sequence underwent a rapid increase in performance. It is primarily due to the application of machine learning methods, many of which rely on the predictive features supplied to them. It is thus crucial to retrieve the information encoded in the amino acid sequence of a protein. Here we propose a method to generate a set of complex yet interpretable predictors, which aids in revealing factors that influence protein conformation. The method makes it possible to generate predictive features and test them for significance both in the context of a general description of the protein structures and functions and in the context of highly specific predictive tasks. Having generated an exhaustive set of predictors, we narrow it down to a smaller curated set of informative features using feature selection methods, which increases the performance of subsequent predictive modelling. We illustrate the efficiency of our methodology by applying it to local protein structure prediction, where the rate of correct prediction for DSSP Q3 (three-class classification) is 81.3%. The method is implemented in C++ for command line use and can be run on any operating system. The source code is released on GitHub at https://github.com/Milchevskiy/protein-encoding-projects.

Keywords: local structure prediction; protein conformation; protein function; protein secondary structure prediction; protein sequence encoding; stepwise discriminant analysis; stepwise regression.

Publication types

  • English Abstract

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Machine Learning*
  • Protein Conformation
  • Proteins* / chemistry
  • Software

Substances

  • Proteins