Measuring quality of DNA sequence data via degradation

PLoS One. 2022 Aug 3;17(8):e0271970. doi: 10.1371/journal.pone.0271970. eCollection 2022.

Abstract

We formulate and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes, illustrated by outlier detection. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Base Sequence
  • Databases, Factual
  • Genome*

Grants and funding

All authors were supported by National Institutes of Health grant 5R01AI100947--06, "Algorithms and Software for the Assembly of Metagenomic Data," to the University of Maryland College Park (Mihai Pop, PI), via a subaward to Fraunhofer USA. The sponsor URL is www.nih.gov. The sponsor played no role in the research, decision to publish, or preparation of the manuscript.