cnnAlpha: Protein disordered regions prediction by reduced amino acid alphabets and convolutional neural networks

Proteins. 2020 Nov;88(11):1472-1481. doi: 10.1002/prot.25966. Epub 2020 Aug 7.

Abstract

Intrinsically disordered regions (IDR) play an important role in key biological processes and are closely related to human diseases. IDRs have great potential to serve as targets for drug discovery, most notably in disordered binding regions. Accurate prediction of IDRs is challenging because their genome wide occurrence and a low ratio of disordered residues make them difficult targets for traditional classification techniques. Existing computational methods mostly rely on sequence profiles to improve accuracy which is time consuming and computationally expensive. This article describes an ab initio sequence-only prediction method-which tries to overcome the challenge of accurate prediction posed by IDRs-based on reduced amino acid alphabets and convolutional neural networks (CNNs). We experiment with six different 3-letter reduced alphabets. We argue that the dimensional reduction in the input alphabet facilitates the detection of complex patterns within the sequence by the convolutional step. Experimental results show that our proposed IDR predictor performs at the same level or outperforms other state-of-the-art methods in the same class, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available Critical Assessment of protein Structure Prediction dataset (CASP10). Therefore, our method is suitable for proteome-wide disorder prediction yielding similar or better accuracy than existing approaches at a faster speed.

Keywords: convolutional neural networks; disordered proteins; machine learning.

MeSH terms

  • Amino Acid Sequence
  • Area Under Curve
  • Benchmarking
  • Computational Biology / methods*
  • Data Mining / statistics & numerical data*
  • Datasets as Topic
  • Humans
  • Intrinsically Disordered Proteins / chemistry*
  • Machine Learning*
  • Multifactor Dimensionality Reduction
  • Neural Networks, Computer*
  • ROC Curve
  • Sequence Analysis, Protein

Substances

  • Intrinsically Disordered Proteins