ParStream-seq: An improved method of handling next generation sequence data

Genomics. 2019 Dec;111(6):1641-1650. doi: 10.1016/j.ygeno.2018.11.014. Epub 2018 Nov 15.

Abstract

The exponential growth of next generation sequencing (NGS) data has put forward the challenge for its storage as well as its efficient and faster analysis. Storing the entire amount of data for a particular experiment and its alignment to the reference genome is an essential step for any quantitative analysis of NGS data. Here, we introduce streaming access technique 'ParStream-seq' that splits the bulk sequence data, accessed from a remote repository into short manageable packets followed by executing their alignment process in parallel in each of the compute core. The optimal packet size with fixed number of reads is determined in the stream that maximizes system utilization. Result shows a reduction in the execution time and improvement in the memory footprint. Overall, this streaming access technique provides means to overcome the hurdle of storing the entire volume of sequence data corresponding to a particular experiment, prior to its analysis.

Keywords: Alignment; Biological big data; HDFS; NGS; Parallel computing; Streaming.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • High-Throughput Nucleotide Sequencing*
  • Sequence Alignment*
  • Sequence Analysis, DNA*
  • Software*