Separation and assembly of deep sequencing data into discrete sub-population genomes

Nucleic Acids Res. 2017 Nov 2;45(19):10989-11003. doi: 10.1093/nar/gkx755.

Abstract

Sequence heterogeneity is a common characteristic of RNA viruses that is often referred to as sub-populations or quasispecies. Traditional techniques used for assembly of short sequence reads produced by deep sequencing, such as de-novo assemblers, ignore the underlying diversity. Here, we introduce a novel algorithm that simultaneously assembles discrete sequences of multiple genomes present in populations. Using in silico data we were able to detect populations at as low as 0.1% frequency with complete global genome reconstruction and in a single sample detected 16 resolved sequences with no mismatches. We also applied the algorithm to high throughput sequencing data obtained for viruses present in sewage samples and successfully detected multiple sub-populations and recombination events in these diverse mixtures. High sensitivity of the algorithm also enables genomic analysis of heterogeneous pathogen genomes from patient samples and accurate detection of intra-host diversity, enabling not just basic research in personalized medicine but also accurate diagnostics and monitoring drug therapies, which are critical in clinical and regulatory decision-making process.

MeSH terms

  • Algorithms*
  • Computational Biology / methods*
  • Genome, Human / genetics*
  • Genome, Viral / genetics
  • Genomics / methods*
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Phylogeny
  • Poliovirus / classification
  • Poliovirus / genetics
  • Reproducibility of Results