Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism

J Biotechnol. 2017 Sep 10:257:58-60. doi: 10.1016/j.jbiotec.2017.02.020. Epub 2017 Feb 21.

Abstract

The introduction of next generation sequencing has caused a steady increase in the amounts of data that have to be processed in modern life science. Sequence alignment plays a key role in the analysis of sequencing data e.g. within whole genome sequencing or metagenome projects. BLAST is a commonly used alignment tool that was the standard approach for more than two decades, but in the last years faster alternatives have been proposed including RapSearch, GHOSTX, and DIAMOND. Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy.

Keywords: Cloud computing; Parallel computing; Sequence alignment.

MeSH terms

  • Cloud Computing*
  • Comparative Genomic Hybridization
  • Computational Biology
  • Genomics / methods
  • High-Throughput Nucleotide Sequencing / instrumentation
  • High-Throughput Nucleotide Sequencing / methods*
  • Internet
  • Metagenome
  • Molecular Sequence Data
  • Sequence Alignment / instrumentation
  • Sequence Alignment / methods*
  • Sequence Analysis, DNA / instrumentation
  • Sequence Analysis, DNA / methods*
  • Software
  • Whole Genome Sequencing