Analyzing genome coverage profiles with applications to quality control in metagenomics

Bioinformatics. 2013 May 15;29(10):1260-7. doi: 10.1093/bioinformatics/btt147. Epub 2013 Apr 14.

Abstract

Motivation: Genome coverage, the number of sequencing reads mapped to a position in a genome, is an insightful indicator of irregularities within sequencing experiments. While the average genome coverage is frequently used within algorithms in computational genomics, the complete information available in coverage profiles (i.e. histograms over all coverages) is currently not exploited to its full extent. Thus, biases such as fragmented or erroneous reference genomes often remain unaccounted for. Making this information accessible can improve the quality of sequencing experiments and quantitative analyses.

Results: We introduce a framework for fitting mixtures of probability distributions to genome coverage profiles. Besides commonly used distributions, we introduce distributions tailored to account for common artifacts. The mixture models are iteratively fitted based on the Expectation-Maximization algorithm. We introduce use cases with focus on metagenomics and develop new analysis strategies to assess the validity of a reference genome with respect to (meta-) genomic read data. The framework is evaluated on simulated data as well as applied to a large-scale metagenomic study, for which we compute the validity of 75 microbial genomes. The results indicate that the choice and quality of reference genomes is vital for metagenomic analyses and that validation of coverage profiles is crucial to avoid incorrect conclusions.

Availability: The code is freely available and can be downloaded from http://sourceforge.net/projects/fitgcp/.

Contact: RenardB@rki.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Bacteria / classification*
  • Bacteria / genetics
  • Bacteria / isolation & purification
  • Gastrointestinal Tract / microbiology
  • Genome
  • Genome, Bacterial
  • Humans
  • Metagenomics*
  • Probability
  • Sequence Analysis, DNA / methods