Genomic GC content prediction in prokaryotes from a sample of genes

Gene. 2005 Sep 12;357(2):137-43. doi: 10.1016/j.gene.2005.06.030.

Abstract

GC level is a key feature in prokaryotic genomes. Widely employed in evolutionary studies, new insights appear however limited because of the relatively low number of characterized genomes. Since public databases mainly comprise several hundreds of prokaryotes with a low number of sequences per genome, a reliable prediction method based on available sequences may be useful for studies that need a trustworthy estimation of whole genomic GC. As the analysis of completely sequenced genomes shows a great variability in distributional shapes, it is of interest to compare different estimators. Our analysis shows that the mean of GC values of a random sample of genes is a reasonable estimator, based on simplicity of the calculation and overall performance. However, usually sequences come from a process that cannot be considered as random sampling. When we analyzed two introduced sources of bias (gene length and protein functional categories) we were able to detect an additional bias in the estimation for some cases, although the precision was not affected. We conclude that the mean genic GC level of a sample of 10 genes is a reliable estimator of genomic GC content, showing comparable accuracy with many widely employed experimental methods.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Base Composition / genetics
  • Computational Biology / methods
  • Genome*
  • Models, Genetic*
  • Prokaryotic Cells / physiology*
  • Sequence Analysis, DNA* / methods