Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144.

Abstract

Summary: DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)-a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. Nucleotide Archival Format compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli and zstd.

Availability and implementation: NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use.

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Algorithms
  • Base Sequence
  • Data Compression*
  • High-Throughput Nucleotide Sequencing
  • Nucleotides
  • Sequence Analysis, DNA
  • Software*

Substances

  • Nucleotides