A systematic exploration of [Formula: see text] cutoff ranges in machine learning models for protein mutation stability prediction

J Bioinform Comput Biol. 2018 Oct;16(5):1840022. doi: 10.1142/S021972001840022X.

Abstract

Discerning how a mutation affects the stability of a protein is central to the study of a wide range of diseases. Mutagenesis experiments on physical proteins provide precise insights about the effects of amino acid substitutions, but such studies are time and cost prohibitive. Computational approaches for informing experimentalists where to allocate wet-lab resources are available, including a variety of machine learning models. Assessing the accuracy of machine learning models for predicting the effects of mutations is dependent on experiments for amino acid substitutions performed in vitro. When similar experiments on physical proteins have been performed by multiple laboratories, the use of the data near the juncture of stabilizing and destabilizing mutations is questionable. In this work, we explore a systematic and principled alternative to discarding experimental data close to the juncture of stabilizing and destabilizing mutations. We model the inconclusive range of experimental [Formula: see text] values via 3- and 5-way classifiers, and systematically explore potential boundaries for the range of inconclusive experimental values. We demonstrate the effectiveness of potential boundaries through confusion matrices and heat map visualizations. We explore two novel metrics for assessing viable cutoff ranges, and find that under these metrics, a lower cutoff near [Formula: see text] and an upper cutoff near [Formula: see text] are optimal across multiple machine learning models.

Keywords: Machine learning; classifier boundaries; protein mutation.

MeSH terms

  • Algorithms
  • Amino Acid Substitution
  • Computational Biology / methods*
  • Machine Learning*
  • Mutation
  • Neural Networks, Computer
  • Protein Stability
  • Proteins / chemistry*
  • Proteins / genetics*
  • Proteins / metabolism
  • Random Allocation
  • Support Vector Machine

Substances

  • Proteins