Fast search of thousands of short-read sequencing experiments

Brad Solomon; Carl Kingsford

doi:10.1038/nbt.3442

Fast search of thousands of short-read sequencing experiments

Nat Biotechnol. 2016 Mar;34(3):300-2. doi: 10.1038/nbt.3442. Epub 2016 Feb 8.

Authors

Brad Solomon¹, Carl Kingsford²

Affiliations

¹ Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
² Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

Abstract

The amount of sequence information in public repositories is growing at a rapid rate. Although these data are likely to contain clinically important information that has not yet been uncovered, our ability to effectively mine these repositories is limited. Here we introduce Sequence Bloom Trees (SBTs), a method for querying thousands of short-read sequencing experiments by sequence, 162 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use SBTs to search 2,652 human blood, breast and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. Searching sequence archives at this scale and in this time frame is currently not possible using existing tools.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Data Mining*
High-Throughput Nucleotide Sequencing / methods*
Humans
RNA / genetics*
Sequence Analysis, DNA / methods
Sequence Analysis, RNA / methods*

Substances

RNA

Abstract

Publication types

MeSH terms

Substances

Grants and funding