Prefix-free parsing for building big BWTs

Christina Boucher; Travis Gagie; Alan Kuhnle; Ben Langmead; Giovanni Manzini; Taher Mun

doi:10.1186/s13015-019-0148-5

Prefix-free parsing for building big BWTs

Algorithms Mol Biol. 2019 May 24:14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.

Authors

Christina Boucher¹, Travis Gagie^{2

3}, Alan Kuhnle^{1

4}, Ben Langmead⁵, Giovanni Manzini^{6

7}, Taher Mun⁵

Affiliations

¹ 1CISE, University of Florida, Gainesville, FL USA.
² 2EIT, Diego Portales University, Santiago, Chile.
³ CeBiB, Santiago, Chile.
⁴ Informatics Institute, Gainesville, FL USA.
⁵ 5Johns Hopkins University, Baltimore, MD USA.
⁶ 6University of Eastern Piedmont, Alessandria, Italy.
⁷ 7IIT, CNR, Pisa, Italy.

Abstract

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

Keywords: Burrows-Wheeler Transform; Compression-aware algorithms; Genomic databases; Prefix-free parsing.

Grants and funding

R01 AI141810/AI/NIAID NIH HHS/United States