Seq2seq Fingerprint with Byte-Pair Encoding for Predicting Changes in Protein Stability upon Single Point Mutation

IEEE/ACM Trans Comput Biol Bioinform. 2020 Sep-Oct;17(5):1762-1772. doi: 10.1109/TCBB.2019.2908641. Epub 2019 Apr 1.

Abstract

The engineering of stable proteins is crucial for various industrial purposes. Several machine learning methods have been developed to predict changes in the stability of proteins corresponding to single point mutations. To improve the prediction accuracy, we propose a new unsupervised descriptor for protein sequences, which is based on a sequence-to-sequence (seq2seq) neural network model combined with a sequence-compression method called byte-pair encoding (BPE). Our results demonstrate that BPE can encode a protein sequence into a sequence of shorter length, thereby enabling efficient training of the seq2seq model. Furthermore, we implement a basic predictor using the proposed descriptor, and our experimental results demonstrate that the predictor achieves state-of-the-art accuracy in tests for proteins that are not included in the training data.

MeSH terms

  • Amino Acid Sequence / genetics
  • Computational Biology / methods*
  • Databases, Genetic
  • Humans
  • Neural Networks, Computer
  • Point Mutation / genetics*
  • Protein Stability*
  • Proteins / chemistry
  • Proteins / genetics
  • Sequence Analysis, Protein / methods*
  • Unsupervised Machine Learning*

Substances

  • Proteins