SeQuiLa: an elastic, fast and scalable SQL-oriented solution for processing and querying genomic intervals

Bioinformatics. 2019 Jun 1;35(12):2156-2158. doi: 10.1093/bioinformatics/bty940.

Abstract

Summary: Efficient processing of large-scale genomic datasets has recently become possible due to the application of 'big data' technologies in bioinformatics pipelines. We present SeQuiLa-a distributed, ANSI SQL-compliant solution for speedy querying and processing of genomic intervals that is available as an Apache Spark package. Proposed range join strategy is significantly (∼22×) faster than the default Apache Spark implementation and outperforms other state-of-the-art tools for genomic intervals processing.

Availability and implementation: The project is available at http://biodatageeks.org/sequila/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Genome
  • Genomics*
  • Software*