Accurate and fast graph-based pangenome annotation and clustering with ggCaller

Genome Res. 2023 Sep;33(9):1622-1637. doi: 10.1101/gr.277733.123. Epub 2023 Aug 24.

Abstract

Bacterial genomes differ in both gene content and sequence mutations, which underlie extensive phenotypic diversity, including variation in susceptibility to antimicrobials or vaccine-induced immunity. To identify and quantify important variants, all genes within a population must be predicted, functionally annotated, and clustered, representing the "pangenome." Despite the volume of genome data available, gene prediction and annotation are currently conducted in isolation on individual genomes, which is computationally inefficient and frequently inconsistent across genomes. Here, we introduce the open-source software graph-gene-caller (ggCaller). ggCaller combines gene prediction, functional annotation, and clustering into a single workflow using population-wide de Bruijn graphs, removing redundancy in gene annotation and resulting in more accurate gene predictions and orthologue clustering. We applied ggCaller to simulated and real-world bacterial data sets containing hundreds or thousands of genomes, comparing it to current state-of-the-art tools. ggCaller has considerable speed-ups with equivalent or greater accuracy, particularly with data sets containing complex sources of error, such as assembly contamination or fragmentation. ggCaller is also an important extension to bacterial genome-wide association studies, enabling querying of annotated graphs for functional analyses. We highlight this application by functionally annotating DNA sequences with significant associations to tetracycline and macrolide resistance in Streptococcus pneumoniae, identifying key resistance determinants that were missed when using only a single reference genome. ggCaller is a novel bacterial genome analysis tool with applications in bacterial evolution and epidemiology.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Anti-Bacterial Agents*
  • Cluster Analysis
  • Drug Resistance, Bacterial
  • Genome, Bacterial
  • Genome-Wide Association Study*
  • Macrolides
  • Molecular Sequence Annotation
  • Software

Substances

  • Anti-Bacterial Agents
  • Macrolides