Pattern Recognition on Read Positioning in Next Generation Sequencing

Boseon Byeon; Igor Kovalchuk

doi:10.1371/journal.pone.0157033

Pattern Recognition on Read Positioning in Next Generation Sequencing

PLoS One. 2016 Jun 14;11(6):e0157033. doi: 10.1371/journal.pone.0157033. eCollection 2016.

Authors

Boseon Byeon¹, Igor Kovalchuk¹

Affiliation

¹ Department of Biological Sciences, University of Lethbridge, Lethbridge, Alberta, T1K 3M4, Canada.

Abstract

The usefulness and the utility of the next generation sequencing (NGS) technology are based on the assumption that the DNA or cDNA cleavage required to generate short sequence reads is random. Several previous reports suggest the existence of sequencing bias of NGS reads. To address this question in greater detail, we analyze NGS data from four organisms with different GC content, Plasmodium falciparum (19.39%), Arabidopsis thaliana (36.03%), Homo sapiens (40.91%) and Streptomyces coelicolor (72.00%). Using machine learning techniques, we recognize the pattern that the NGS read start is positioned in the local region where the nucleotide distribution is dissimilar from the global nucleotide distribution. We also demonstrate that the mono-nucleotide distribution underestimates sequencing bias, and the recognized pattern is explained largely by the distribution of multi-nucleotides (di-, tri-, and tetra- nucleotides) rather than mono-nucleotides. This implies that the correction of sequencing bias needs to be performed on the basis of the multi-nucleotide distribution. Providing companion software to quantify the effect of the recognized pattern on read positioning, we exemplify that the bias correction based on the mono-nucleotide distribution may not be sufficient to clean sequencing bias.

MeSH terms

Arabidopsis / genetics
Base Composition
Base Sequence
DNA / analysis
DNA / genetics*
High-Throughput Nucleotide Sequencing / methods*
Humans
Machine Learning
Pattern Recognition, Automated / methods*
Plasmodium falciparum / genetics
Sequence Analysis, DNA / methods*
Software
Streptomyces coelicolor / genetics

Substances

DNA

Grants and funding

This work was funded by a Natural Sciences and Engineering Research Council of Canada Discovery grant to IK and NSERC Create grant to BB.