High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach

Anas Oujja; Mohamed Riduan Abid; Jaouad Boumhidi; Safae Bourhnane; Asmaa Mourhir; Fatima Merchant; Driss Benhaddou

doi:10.5808/gi.21056

High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach

Genomics Inform. 2021 Dec;19(4):e49. doi: 10.5808/gi.21056. Epub 2021 Dec 31.

Authors

Anas Oujja^{1

2}, Mohamed Riduan Abid¹, Jaouad Boumhidi², Safae Bourhnane^{1

3}, Asmaa Mourhir¹, Fatima Merchant⁴, Driss Benhaddou⁴

Affiliations

¹ School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco.
² Computer Science, Signals, Automation and Cognitivism Laboratory (LISAC), Computer Science Department, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco.
³ Faculty of Sciences, Chouaib Doukkali University, El Jadida 24000, Morocco.
⁴ Computer Engineering Technology Faculty, University of Houston, Houston, TX 77204, USA.

Abstract

Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

Keywords: RNA; SARS-COV-2; bioinformatics; data science; high-performance computing; longest common subsequence.