Apache Spark Implementations for String Patterns in DNA Sequences

Andreas Kanavos; Ioannis Livieris; Phivos Mylonas; Spyros Sioutas; Gerasimos Vonitsanos

doi:10.1007/978-3-030-32622-7_42

Apache Spark Implementations for String Patterns in DNA Sequences

Adv Exp Med Biol. 2020:1194:439-453. doi: 10.1007/978-3-030-32622-7_42.

Authors

Andreas Kanavos¹, Ioannis Livieris², Phivos Mylonas³, Spyros Sioutas⁴, Gerasimos Vonitsanos⁴

Affiliations

¹ Computer Engineering and Informatics Department, University of Patras, Patras, Greece. kanavos@ceid.upatras.gr.
² Department of Mathematics, University of Patras, Patras, Greece. livieris@gmail.com.
³ Department of Informatics, Ionian University, Corfu, Greece.
⁴ Computer Engineering and Informatics Department, University of Patras, Patras, Greece.

PMID: 32468560
DOI: 10.1007/978-3-030-32622-7_42

Abstract

The availability of numerical data grows from 1 day to another in a remarkable way. New technologies of high-throughput Next-Generation Sequencing (NGS) are producing DNA sequences. Next-Generation Sequencing describes a DNA sequencing technology which has revolutionized genomic research. In this paper, we perform some experiments using a cloud infrastructure framework, namely, Apache Spark, in some sequences derived from the National Center for Biotechnology Information (NCBI). The problems we examine are some of the most popular ones, namely, Longest Common Prefix, Longest Common Substring, and Longest Common Subsequence.

Keywords: DNA sequencing; Longest Common Prefix (LCP); Longest Common Subsequence (LCS); Longest Common Substring.

MeSH terms

Algorithms
Base Sequence
Cloud Computing
Genome / genetics
Genomics / methods
High-Throughput Nucleotide Sequencing*
Sequence Analysis, DNA* / methods
Software* / standards