A protein sequence fitness function for identifying natural and nonnatural proteins

Proteins. 2020 Oct;88(10):1271-1284. doi: 10.1002/prot.25900. Epub 2020 May 28.

Abstract

The infinitesimally small sequence space naturally scouted in the millions of years of evolution suggests that the natural proteins are constrained by some functional prerequisites and should differ from randomly generated sequences. We have developed a protein sequence fitness scoring function that implements sequence and corresponding secondary structural information at tripeptide levels to differentiate natural and nonnatural proteins. The proposed fitness function is extensively validated on a dataset of about 210 000 natural and nonnatural protein sequences and benchmarked with existing methods for differentiating natural and nonnatural proteins. The high sensitivity, specificity, and percentage accuracy (0.81%, 0.95%, and 91% respectively) of the fitness function demonstrates its potential application for sampling the protein sequences with higher probability of mimicking natural proteins. Moreover, the four major classes of proteins (α proteins, β proteins, α/β proteins, and α + β proteins) are separately analyzed and β proteins are found to score slightly lower as compared to other classes. Further, an analysis of about 250 designed proteins (adopted from previously reported cases) helped to define the boundaries for sampling the ideal protein sequences. The protein sequence characterization aided by the proposed fitness function could facilitate the exploration of new perspectives in the design of novel functional proteins.

Keywords: amino acid propensity; computational protein design; natural proteins; protein foldability; protein sequence space; scoring of protein designs.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Benchmarking
  • Datasets as Topic
  • Humans
  • Models, Statistical*
  • Protein Engineering / methods*
  • Protein Folding
  • Protein Structure, Secondary
  • Proteins / chemistry*
  • ROC Curve
  • Research Design*

Substances

  • Proteins