NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information

Nucleic Acids Res. 2021 Jul 2;49(W1):W469-W475. doi: 10.1093/nar/gkab398.

Abstract

With the explosive growth of protein sequences, large-scale automated protein function prediction (AFP) is becoming challenging. A protein is usually associated with dozens of gene ontology (GO) terms. Therefore, AFP is regarded as a problem of large-scale multi-label classification. Under the learning to rank (LTR) framework, our previous NetGO tool integrated massive networks and multi-type information about protein sequences to achieve good performance by dealing with all possible GO terms (>44 000). In this work, we propose the updated version as NetGO 2.0, which further improves the performance of large-scale AFP. NetGO 2.0 also incorporates literature information by logistic regression and deep sequence information by recurrent neural network (RNN) into the framework. We generate datasets following the critical assessment of functional annotation (CAFA) protocol. Experiment results show that NetGO 2.0 outperformed NetGO significantly in biological process ontology (BPO) and cellular component ontology (CCO). In particular, NetGO 2.0 achieved a 12.6% improvement over NetGO in terms of area under precision-recall curve (AUPR) in BPO and around 2.6% in terms of $\mathbf {F_{max}}$ in CCO. These results demonstrate the benefits of incorporating text and deep sequence information for the functional annotation of BPO and CCO. The NetGO 2.0 web server is freely available at http://issubmission.sjtu.edu.cn/ng2/.

Publication types

  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • CCAAT-Binding Factor / chemistry
  • CCAAT-Binding Factor / metabolism
  • Caenorhabditis elegans Proteins / chemistry
  • Caenorhabditis elegans Proteins / metabolism
  • High-Throughput Nucleotide Sequencing
  • Neural Networks, Computer
  • Protein Domains
  • Proteins / classification
  • Proteins / metabolism
  • Proteins / physiology*
  • Sequence Analysis, Protein
  • Software*

Substances

  • CCAAT-Binding Factor
  • Caenorhabditis elegans Proteins
  • Proteins