A Hierarchical Error Correction Strategy for Text DNA Storage

Interdiscip Sci. 2022 Mar;14(1):141-150. doi: 10.1007/s12539-021-00476-x. Epub 2021 Aug 31.

Abstract

DNA storage has been a thriving interdisciplinary research area because of its high density, low maintenance cost, and long durability for information storage. However, the complexity of errors in DNA sequences including substitutions, insertions and deletions hinders its application for massive data storage. Motivated by the divide-and-conquer algorithm, we propose a hierarchical error correction strategy for text DNA storage. The basic idea is to design robust codes for common characters which have one-base error correction ability including insertion and/or deletion. The errors are gradually corrected by the codes in DNA reads, multiple alignment of character lines, and finally word spelling. On one hand, the proposed encoding method provides a systematic way to design storage friendly codes, such as 50% GC content, no more than 2-base homopolymers, and robustness against secondary structures. On the other hand, the proposed error correction method not only corrects single insertion or deletion, but also deals with multiple insertions or deletions. Simulation results demonstrate that the proposed method can correct more than 98% errors when error rate is less than or equal to 0.05. Thus, it is more powerful and adaptable to the complicated DNA storage applications.

Keywords: DNA storage; Deletion; Insertion; Robust codes; Substitution.

MeSH terms

  • Algorithms*
  • Base Sequence
  • Computer Simulation
  • DNA* / chemistry
  • Sequence Analysis, DNA / methods

Substances

  • DNA