MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes

Wei Zhou; Ruilin Li; Shuo Yuan; ChangChun Liu; Shaowen Yao; Jing Luo; Beifang Niu

doi:10.1093/bioinformatics/btw750

MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes

Bioinformatics. 2017 Apr 1;33(7):1090-1092. doi: 10.1093/bioinformatics/btw750.

Authors

Wei Zhou¹, Ruilin Li^{2

3}, Shuo Yuan¹, ChangChun Liu¹, Shaowen Yao¹, Jing Luo⁴, Beifang Niu^{2

3}

Affiliations

¹ School of Software, Yunnan University, Kunming, China.
² Computer Network Information Center of Chinese Academy of Sciences, Beijing, China.
³ University of Chinese Academy of Sciences, Beijing 100190, China.
⁴ School of Life Sciences and State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Yunnan University, Kunming, China.

PMID: 28065898
DOI: 10.1093/bioinformatics/btw750

Abstract

Summary: With the advent of next-generation sequencing, traditional bioinformatics tools are challenged by massive raw metagenomic datasets. One of the bottlenecks of metagenomic studies is lack of large-scale and cloud computing suitable data analysis tools. In this paper, we proposed a Spark based tool, called MetaSpark, to recruit metagenomic reads to reference genomes. MetaSpark benefits from the distributed data set (RDD) of Spark, which makes it able to cache data set in memory across cluster nodes and scale well with the datasets. Compared with previous metagenomics recruitment tools, MetaSpark recruited significantly more reads than many programs such as SOAP2, BWA and LAST and increased recruited reads by ∼4% compared with FR-HIT when there were 1 million reads and 0.75 GB references. Different test cases demonstrate MetaSpark's scalability and overall high performance.

Availability: https://github.com/zhouweiyg/metaspark.

Contact: bniu@sccas.cn , jingluo@ynu.edu.cn.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Evaluation Study

MeSH terms

Algorithms
Genome
High-Throughput Nucleotide Sequencing / methods*
High-Throughput Nucleotide Sequencing / standards
Humans
Metagenomics / methods*
Metagenomics / standards
Reference Standards
Software*