High-Performance Framework to Analyze Microarray Data

Methods Mol Biol. 2022:2401:13-27. doi: 10.1007/978-1-0716-1839-4_2.

Abstract

Pharmacogenomics is an important research field that studies the impact of genetic variation of patients on drug responses, looking for correlations between single nucleotide polymorphisms (SNPs) of patient genome and drug toxicity or efficacy. The large number of available samples and the high resolution of the instruments allow microarray platforms to produce huge amounts of SNP data. To analyze such data and find correlations in a reasonable time, high-performance computing solutions must be used. Cloud4SNP is a bioinformatics tool, based on Data Mining Cloud Framework (DMCF), for parallel preprocessing and statistical analysis of SNP pharmacogenomics microarray data.This work describes how Cloud4SNP has been extended to execute applications on Apache Spark, which provides faster execution time for iterative and batch processing. The experimental evaluation shows that Cloud4SNP is able to exploit the high-performance features of Apache Spark, obtaining faster execution times and high level of scalability, with a global speedup that is very close to linear values.

Keywords: Cloud computing; Pharmacogenomics; Single nucleotide polymorphisms; Statistical analysis.

MeSH terms

  • Algorithms
  • Computational Biology*
  • Computing Methodologies
  • Genome
  • Humans
  • Microarray Analysis*
  • Software