Partial order relation-based gene ontology embedding improves protein function prediction

Brief Bioinform. 2024 Jan 22;25(2):bbae077. doi: 10.1093/bib/bbae077.

Abstract

Protein annotation has long been a challenging task in computational biology. Gene Ontology (GO) has become one of the most popular frameworks to describe protein functions and their relationships. Prediction of a protein annotation with proper GO terms demands high-quality GO term representation learning, which aims to learn a low-dimensional dense vector representation with accompanying semantic meaning for each functional label, also known as embedding. However, existing GO term embedding methods, which mainly take into account ancestral co-occurrence information, have yet to capture the full topological information in the GO-directed acyclic graph (DAG). In this study, we propose a novel GO term representation learning method, PO2Vec, to utilize the partial order relationships to improve the GO term representations. Extensive evaluations show that PO2Vec achieves better outcomes than existing embedding methods in a variety of downstream biological tasks. Based on PO2Vec, we further developed a new protein function prediction method PO2GO, which demonstrates superior performance measured in multiple metrics and annotation specificity as well as few-shot prediction capability in the benchmarks. These results suggest that the high-quality representation of GO structure is critical for diverse biological tasks including computational protein annotation.

Keywords: Gene Ontology; partial order constraint; protein annotation; protein function prediction; representation learning.

MeSH terms

  • Benchmarking*
  • Computational Biology*
  • Gene Ontology
  • Learning
  • Molecular Sequence Annotation