Optimal compressed representation of high throughput sequence data via light assembly

Antonio A Ginart; Joseph Hui; Kaiyuan Zhu; Ibrahim Numanagić; Thomas A Courtade; S Cenk Sahinalp; David N Tse

doi:10.1038/s41467-017-02480-6

Optimal compressed representation of high throughput sequence data via light assembly

Nat Commun. 2018 Feb 8;9(1):566. doi: 10.1038/s41467-017-02480-6.

Authors

Antonio A Ginart¹, Joseph Hui², Kaiyuan Zhu³, Ibrahim Numanagić⁴, Thomas A Courtade⁵, S Cenk Sahinalp⁶, David N Tse¹

Affiliations

¹ Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA.
² Department of Electrical Engineering & Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
³ Department of Computer Science, Indiana University Bloomington, Bloomington, IN, 47405, USA. kzhu@indiana.edu.
⁴ Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
⁵ Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, 94720, USA.
⁶ Department of Computer Science, Indiana University Bloomington, Bloomington, IN, 47405, USA. cenksahi@indiana.edu.

Abstract

The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Computational Biology / methods*
Genome / genetics
Genomics / methods*
High-Throughput Nucleotide Sequencing / methods*
Reproducibility of Results
Software

Grants and funding

R01 GM108348/GM/NIGMS NIH HHS/United States