Data Management for Heterogeneous Genomic Datasets

IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264. doi: 10.1109/TCBB.2016.2576447. Epub 2016 Jun 7.

Abstract

Next Generation Sequencing (NGS), a family of technologies for reading DNA and RNA, is changing biological research, and will soon change medical practice, by quickly providing sequencing data and high-level features of numerous individual genomes in different biological and clinical conditions. The availability of millions of whole genome sequences may soon become the biggest and most important "big data" problem of mankind. In this exciting framework, we recently proposed a new paradigm to raise the level of abstraction in NGS data management, by introducing a GenoMetric Query Language (GMQL) and demonstrating its usefulness through several biological query examples. Leveraging on that effort, here we motivate and formalize GMQL operations, especially focusing on the most characteristic and domain-specific ones. Furthermore, we address their efficient implementation and illustrate the architecture of the new software system that we have developed for their execution on big genomic data in a cloud computing environment, providing the evaluation of its performance. The new system implementation is available for download at the GMQL website (http://www.bioinformatics.deib.polimi.it/GMQL/); GMQL can also be tested through a set of predefined queries on ENCODE and Roadmap Epigenomics data at http://www.bioinformatics.deib.polimi.it/GMQL/queries/.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Cloud Computing
  • Database Management Systems*
  • Databases, Genetic*
  • Genomics*
  • Sequence Analysis, DNA