Near perfect protein multi-label classification with deep neural networks

Methods. 2018 Jan 1:132:50-56. doi: 10.1016/j.ymeth.2017.06.034. Epub 2017 Jul 3.

Abstract

Biological sequences can be considered as data items of high-, non-fixed dimensions, corresponding to the length of those sequences. The comparison and the classification of biological sequences in their relations to large databases are important areas of research today. Artificial neural networks (ANNs) have gained a well-deserved popularity among machine learning tools upon their recent successful applications in image- and sound processing and classification problems. ANNs have also been applied for predicting the family or function of a protein, knowing its residue sequence. Here we present two new ANNs with multi-label classification ability, showing impressive accuracy when classifying protein sequences into 698 UniProt families (AUC=99.99%) and 983 Gene Ontology classes (AUC=99.45%).

MeSH terms

  • Algorithms
  • Area Under Curve
  • Gene Ontology
  • Molecular Sequence Annotation
  • Neural Networks, Computer
  • Proteins / genetics*
  • Proteins / metabolism
  • Proteogenomics / methods*

Substances

  • Proteins