MetaSMC: a coalescent-based shotgun sequence simulator for evolving microbial populations

Bioinformatics. 2019 May 15;35(10):1677-1685. doi: 10.1093/bioinformatics/bty840.

Abstract

Motivation: High-throughput sequencing technology has revolutionized the study of metagenomics and cancer evolution. In a relatively simple environment, a metagenomics sequencing data is dominated by a few species. By analyzing the alignment of reads from microbial species, single nucleotide polymorphisms can be discovered and the evolutionary history of the populations can be reconstructed. The ever-increasing read length will allow more detailed analysis about the evolutionary history of microbial or tumor cell population. A simulator of shotgun sequences from such populations will be helpful in the development or evaluation of analysis algorithms.

Results: Here, we described an efficient algorithm, MetaSMC, which simulates reads from evolving microbial populations. Based on the coalescent theory, our simulator supports all evolutionary scenarios supported by other coalescent simulators. In addition, the simulator supports various substitution models, including Jukes-Cantor, HKY85 and generalized time-reversible models. The simulator also supports mutator phenotypes by allowing different mutation rates and substitution models in different subpopulations. Our algorithm ignores unnecessary chromosomal segments and thus is more efficient than standard coalescent when recombination is frequent. We showed that the process behind our algorithm is equivalent to Sequentially Markov Coalescent with an incomplete sample. The accuracy of our algorithm was evaluated by summary statistics and likelihood curves derived from Monte Carlo integration over large number of random genealogies.

Availability and implementation: MetaSMC is written in C. The source code is available at https://github.com/tarjxvf/metasmc.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Base Sequence
  • Genetics, Population*
  • High-Throughput Nucleotide Sequencing
  • Metagenomics
  • Sequence Analysis, DNA
  • Software*