Gene function prediction in five model eukaryotes exclusively based on gene relative location through machine learning

Sci Rep. 2022 Jul 8;12(1):11655. doi: 10.1038/s41598-022-15329-w.

Abstract

The function of most genes is unknown. The best results in automated function prediction are obtained with machine learning-based methods that combine multiple data sources, typically sequence derived features, protein structure and interaction data. Even though there is ample evidence showing that a gene's function is not independent of its location, the few available examples of gene function prediction based on gene location rely on sequence identity between genes of different organisms and are thus subjected to the limitations of the relationship between sequence and function. Here we predict thousands of gene functions in five model eukaryotes (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens) using machine learning models exclusively trained with features derived from the location of genes in the genomes to which they belong. Our aim was not to obtain the best performing method to automated function prediction but to explore the extent to which a gene's location can predict its function in eukaryotes. We found that our models outperform BLAST when predicting terms from Biological Process and Cellular Component Ontologies, showing that, at least in some cases, gene location alone can be more useful than sequence to infer gene function.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Caenorhabditis elegans / genetics
  • Caenorhabditis elegans / metabolism
  • Drosophila melanogaster* / genetics
  • Genome
  • Machine Learning*
  • Mice
  • Phenotype
  • Saccharomyces cerevisiae / genetics