Sequencing error profiles of Illumina sequencing instruments

Nicholas Stoler; Anton Nekrutenko

doi:10.1093/nargab/lqab019

Sequencing error profiles of Illumina sequencing instruments

NAR Genom Bioinform. 2021 Mar 27;3(1):lqab019. doi: 10.1093/nargab/lqab019. eCollection 2021 Mar.

Authors

Nicholas Stoler¹, Anton Nekrutenko²

Affiliations

¹ Graduate Program in Bioinformatics and Genomics, The Huck Institutes for Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA.
² Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA.

Abstract

Sequencing technology has achieved great advances in the past decade. Studies have previously shown the quality of specific instruments in controlled conditions. Here, we developed a method able to retroactively determine the error rate of most public sequencing datasets. To do this, we utilized the overlaps between reads that are a feature of many sequencing libraries. With this method, we surveyed 1943 different datasets from seven different sequencing instruments produced by Illumina. We show that among public datasets, the more expensive platforms like HiSeq and NovaSeq have a lower error rate and less variation. But we also discovered that there is great variation within each platform, with the accuracy of a sequencing experiment depending greatly on the experimenter. We show the importance of sequence context, especially the phenomenon where preceding bases bias the following bases toward the same identity. We also show the difference in patterns of sequence bias between instruments. Contrary to expectations based on the underlying chemistry, HiSeq X Ten and NovaSeq 6000 share notable exceptions to the preceding-base bias. Our results demonstrate the importance of the specific circumstances of every sequencing experiment, and the importance of evaluating the quality of each one.

Abstract

Grants and funding