Bioinformatics applications on Apache Spark

Runxin Guo; Yi Zhao; Quan Zou; Xiaodong Fang; Shaoliang Peng

doi:10.1093/gigascience/giy098

Bioinformatics applications on Apache Spark

Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.

Authors

Runxin Guo¹, Yi Zhao², Quan Zou³, Xiaodong Fang⁴, Shaoliang Peng^{1

5}

Affiliations

¹ College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China.
² Institute of Computing Technology, Chinese Academy of Sciences, No.6, South Road of the Academy of Sciences, Haidian District, Beijing, 100190, China.
³ School of Computer Science and Technology, No.135, Yaguan Road, Jinnan District, Tianjin University, Tianjin, 300050, China.
⁴ BGI Genomics, BGI-Shenzhen, No.21, Mingzhu Road, Yantian District, Shenzhen, 518083, China.
⁵ College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, No.252, Shannan Road, Yuelu District, Changsha, 410082, China.

Abstract

With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of-the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.

Publication types

Research Support, Non-U.S. Gov't
Review

MeSH terms

Animals
Computational Biology / instrumentation
Computational Biology / methods
Genomics / instrumentation*
Genomics / methods
High-Throughput Nucleotide Sequencing / instrumentation*
High-Throughput Nucleotide Sequencing / methods
Humans
Mice
Software