SparkGC: Spark based genome compression for large collections of genomes

Haichang Yao; Guangyong Hu; Shangdong Liu; Houzhi Fang; Yimu Ji

doi:10.1186/s12859-022-04825-5

SparkGC: Spark based genome compression for large collections of genomes

BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.

Authors

Haichang Yao¹, Guangyong Hu¹, Shangdong Liu², Houzhi Fang², Yimu Ji^{3

4

5}

Affiliations

¹ School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China.
² School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China.
³ School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China. jiym@njupt.edu.cn.
⁴ Jiangsu HPC and Intelligent Processing Engineer Research Center, Nanjing, 210003, China. jiym@njupt.edu.cn.
⁵ Institute of High Performance Computing and Bigdata, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China. jiym@njupt.edu.cn.

Abstract

Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark's in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC .

Keywords: Distributed parallel; Genome compression; Reference-based compression; Spark.

MeSH terms

Algorithms*
Data Compression* / methods
Genome
High-Throughput Nucleotide Sequencing / methods
Humans
Sequence Analysis, DNA / methods
Software

Abstract

MeSH terms

Grants and funding