Optimizing the Parametrization of Homologue Classification in the Pan-Genome Computation for a Bacterial Species: Case Study Streptococcus pyogenes

Methods Mol Biol. 2022:2449:299-324. doi: 10.1007/978-1-0716-2095-3_13.

Abstract

The paradigm shift associated with the introduction of the pan-genome concept has drawn the attention from singular reference genomes toward the actual sequence diversity within organism populations, strain collections, clades, etc. A single genome is no longer sufficient to describe bacteria of interest, but instead, the genomic repertoire of all existing strains is the key to the metabolic, evolutionary, or pathogenic potential of a species. The classification of orthologous genes derived from a collection of taxonomically related genome sequences is central to bacterial pan-genome computational analysis. In this work, we present a review of methods for computing pan-genome gene clusters including their comparative analysis for the case of Streptococcus pyogenes strain genomes. We exhaustively scanned the parametrization space of the homologue searching procedures and find optimal parameters (sequence identity (60%) and coverage (50-60%) in the pairwise alignment) for the orthologous clustering of gene sequences. We find that the sequence identity threshold influences the number of gene families ~3 times stronger than the sequence coverage threshold.

Keywords: Bacterial genome; Core genome; Orthologue; Pan-genome; Paralogue.

Publication types

  • Review

MeSH terms

  • Cluster Analysis
  • Genome, Bacterial*
  • Genomics / methods
  • Multigene Family
  • Phylogeny
  • Streptococcus pyogenes* / genetics