A novel conceptual approach to read-filtering in high-throughput amplicon sequencing studies

Fernando Puente-Sánchez; Jacobo Aguirre; Víctor Parro

doi:10.1093/nar/gkv1113

A novel conceptual approach to read-filtering in high-throughput amplicon sequencing studies

Nucleic Acids Res. 2016 Feb 29;44(4):e40. doi: 10.1093/nar/gkv1113. Epub 2015 Nov 8.

Authors

Fernando Puente-Sánchez¹, Jacobo Aguirre², Víctor Parro³

Affiliations

¹ Department of Molecular Evolution, Centro de Astrobiología (INTA-CSIC). Instituto Nacional de Técnica Aeroespacial, Ctra de Torrejón a Ajalvir km 4. 28850 Torrejón de Ardoz, Madrid, Spain fpusan@gmail.com.
² Department of Molecular Evolution, Centro de Astrobiología (INTA-CSIC). Instituto Nacional de Técnica Aeroespacial, Ctra de Torrejón a Ajalvir km 4. 28850 Torrejón de Ardoz, Madrid, Spain Centro Nacional de Biotecnología (CSIC). c/ Darwin 3, 28049 Madrid, Spain Grupo Interdisciplinar de Sistemas Complejos (GISC), Madrid, Spain.
³ Department of Molecular Evolution, Centro de Astrobiología (INTA-CSIC). Instituto Nacional de Técnica Aeroespacial, Ctra de Torrejón a Ajalvir km 4. 28850 Torrejón de Ardoz, Madrid, Spain.

Abstract

Adequate read filtering is critical when processing high-throughput data in marker-gene-based studies. Sequencing errors can cause the mis-clustering of otherwise similar reads, artificially increasing the number of retrieved Operational Taxonomic Units (OTUs) and therefore leading to the overestimation of microbial diversity. Sequencing errors will also result in OTUs that are not accurate reconstructions of the original biological sequences. Herein we present the Poisson binomial filtering algorithm (PBF), which minimizes both problems by calculating the error-probability distribution of a sequence from its quality scores. In order to validate our method, we quality-filtered 37 publicly available datasets obtained by sequencing mock and environmental microbial communities with the Roche 454, Illumina MiSeq and IonTorrent PGM platforms, and compared our results to those obtained with previous approaches such as the ones included in mothur, QIIME and USEARCH. Our algorithm retained substantially more reads than its predecessors, while resulting in fewer and more accurate OTUs. This improved sensitiveness produced more faithful representations, both quantitatively and qualitatively, of the true microbial diversity present in the studied samples. Furthermore, the method introduced in this work is computationally inexpensive and can be readily applied in conjunction with any existent analysis pipeline.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Bacteria / genetics*
Biodiversity
Computational Biology / methods*
High-Throughput Nucleotide Sequencing / methods*
Quality Control*
Sequence Analysis, DNA / methods