Statistical guidelines for quality control of next-generation sequencing techniques

Life Sci Alliance. 2021 Aug 30;4(11):e202101113. doi: 10.26508/lsa.202101113. Print 2021 Nov.

Abstract

More and more next-generation sequencing (NGS) data are made available every day. However, the quality of this data is not always guaranteed. Available quality control tools require profound knowledge to correctly interpret the multiplicity of quality features. Moreover, it is usually difficult to know if quality features are relevant in all experimental conditions. Therefore, the NGS community would highly benefit from condition-specific data-driven guidelines derived from many publicly available experiments, which reflect routinely generated NGS data. In this work, we have characterized well-known quality guidelines and related features in big datasets and concluded that they are too limited for assessing the quality of a given NGS file accurately. Therefore, we present new data-driven guidelines derived from the statistical analysis of many public datasets using quality features calculated by common bioinformatics tools. Thanks to this approach, we confirm the high relevance of genome mapping statistics to assess the quality of the data, and we demonstrate the limited scope of some quality features that are not relevant in all conditions. Our guidelines are available at https://cbdm.uni-mainz.de/ngs-guidelines.

MeSH terms

  • Computational Biology / methods
  • Genome, Human
  • High-Throughput Nucleotide Sequencing / methods*
  • High-Throughput Nucleotide Sequencing / standards*
  • Humans
  • Quality Control
  • Sequence Analysis, DNA / methods*
  • Sequence Analysis, DNA / statistics & numerical data
  • Software