Apache Spark Implementations for String Patterns in DNA Sequences

Adv Exp Med Biol. 2020:1194:439-453. doi: 10.1007/978-3-030-32622-7_42.

Abstract

The availability of numerical data grows from 1 day to another in a remarkable way. New technologies of high-throughput Next-Generation Sequencing (NGS) are producing DNA sequences. Next-Generation Sequencing describes a DNA sequencing technology which has revolutionized genomic research. In this paper, we perform some experiments using a cloud infrastructure framework, namely, Apache Spark, in some sequences derived from the National Center for Biotechnology Information (NCBI). The problems we examine are some of the most popular ones, namely, Longest Common Prefix, Longest Common Substring, and Longest Common Subsequence.

Keywords: DNA sequencing; Longest Common Prefix (LCP); Longest Common Subsequence (LCS); Longest Common Substring.

MeSH terms

  • Algorithms
  • Base Sequence
  • Cloud Computing
  • Genome / genetics
  • Genomics / methods
  • High-Throughput Nucleotide Sequencing*
  • Sequence Analysis, DNA* / methods
  • Software* / standards