ngsComposer: an automated pipeline for empirically based NGS data quality filtering

Ryan D Kuster; G Craig Yencho; Bode A Olukolu

doi:10.1093/bib/bbab092

ngsComposer: an automated pipeline for empirically based NGS data quality filtering

Brief Bioinform. 2021 Sep 2;22(5):bbab092. doi: 10.1093/bib/bbab092.

Authors

Ryan D Kuster¹, G Craig Yencho², Bode A Olukolu¹

Affiliations

¹ Department of Entomology and Plant Pathology, University of Tennessee, USA.
² Department of Horticultural Science, NC State University, USA.

Abstract

Next-generation sequencing (NGS) enables massively parallel acquisition of large-scale omics data; however, objective data quality filtering parameters are lacking. Although a useful metric, evidence reveals that platform-generated Phred values overestimate per-base quality scores. We have developed novel and empirically based algorithms that streamline NGS data quality filtering. The pipeline leverages known sequence motifs to enable empirical estimation of error rates, detection of erroneous base calls and removal of contaminating adapter sequence. The performance of motif-based error detection and quality filtering were further validated with read compression rates as an unbiased metric. Elevated error rates at read ends, where known motifs lie, tracked with propagation of erroneous base calls. Barcode swapping, an inherent problem with pooled libraries, was also effectively mitigated. The ngsComposer pipeline is suitable for various NGS protocols and platforms due to the universal concepts on which the algorithms are based.

Keywords: buffer sequence; demultiplexing; qRRS (quantitative reduced representation sequencing); quality score; short-read pre-processing, OmeSeq.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Computational Biology / methods*
Computer Simulation
High-Throughput Nucleotide Sequencing / methods*
Humans
Reproducibility of Results
Sequence Analysis, DNA / methods*
Software*