AmpliCI: a high-resolution model-based approach for denoising Illumina amplicon data

Bioinformatics. 2021 Jan 29;36(21):5151-5158. doi: 10.1093/bioinformatics/btaa648.

Abstract

Motivation: Next-generation amplicon sequencing is a powerful tool for investigating microbial communities. A main challenge is to distinguish true biological variants from errors caused by amplification and sequencing. In traditional analyses, such errors are eliminated by clustering reads within a sequence similarity threshold, usually 97%, and constructing operational taxonomic units, but the arbitrary threshold leads to low resolution and high false-positive rates. Recently developed 'denoising' methods have proven able to resolve single-nucleotide amplicon variants, but they still miss low-frequency sequences, especially those near more frequent sequences, because they ignore the sequencing quality information.

Results: We introduce AmpliCI, a reference-free, model-based method for rapidly resolving the number, abundance and identity of error-free sequences in massive Illumina amplicon datasets. AmpliCI considers the quality information and allows the data, not an arbitrary threshold or an external database, to drive conclusions. AmpliCI estimates a finite mixture model, using a greedy strategy to gradually select error-free sequences and approximately maximize the likelihood. AmpliCI has better performance than three popular denoising methods, with acceptable computation time and memory usage.

Availability and implementation: Source code is available at https://github.com/DormanLab/AmpliCI.

Supplementary information: Supplementary material are available at Bioinformatics online.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Cluster Analysis
  • High-Throughput Nucleotide Sequencing
  • Microbiota*
  • Sequence Analysis, DNA
  • Software