A Massively Parallel Sequence Similarity Search for Metagenomic Sequencing Data

Masanori Kakuta; Shuji Suzuki; Kazuki Izawa; Takashi Ishida; Yutaka Akiyama

doi:10.3390/ijms18102124

A Massively Parallel Sequence Similarity Search for Metagenomic Sequencing Data

Int J Mol Sci. 2017 Oct 11;18(10):2124. doi: 10.3390/ijms18102124.

Authors

Masanori Kakuta¹, Shuji Suzuki^{2

3}, Kazuki Izawa⁴, Takashi Ishida^{5

6

7}, Yutaka Akiyama^{8

9

10}

Affiliations

¹ Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 W8-76 Ookayama, Meguro-ku, Tokyo 152-8550, Japan. kakuta@bi.cs.titech.ac.jp.
² Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 W8-76 Ookayama, Meguro-ku, Tokyo 152-8550, Japan. suzuki@bi.cs.titech.ac.jp.
³ Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, 4259 J3-141 Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa 226-8503, Japan. suzuki@bi.cs.titech.ac.jp.
⁴ Department of Computer Science, School of Computing, Tokyo Institute of Technology, 2-12-1 W8-76 Ookayama, Meguro-ku, Tokyo 152-8550, Japan. izawa@bi.c.titech.ac.jp.
⁵ Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 W8-76 Ookayama, Meguro-ku, Tokyo 152-8550, Japan. ishida@c.titech.ac.jp.
⁶ Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, 4259 J3-141 Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa 226-8503, Japan. ishida@c.titech.ac.jp.
⁷ Department of Computer Science, School of Computing, Tokyo Institute of Technology, 2-12-1 W8-76 Ookayama, Meguro-ku, Tokyo 152-8550, Japan. ishida@c.titech.ac.jp.
⁸ Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 W8-76 Ookayama, Meguro-ku, Tokyo 152-8550, Japan. akiyama@c.titech.ac.jp.
⁹ Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, 4259 J3-141 Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa 226-8503, Japan. akiyama@c.titech.ac.jp.
¹⁰ Department of Computer Science, School of Computing, Tokyo Institute of Technology, 2-12-1 W8-76 Ookayama, Meguro-ku, Tokyo 152-8550, Japan. akiyama@c.titech.ac.jp.

Abstract

Sequence similarity searches have been widely used in the analyses of metagenomic sequencing data. Finding homologous sequences in a reference database enables the estimation of taxonomic and functional characteristics of each query sequence. Because current metagenomic sequencing data consist of a large number of nucleotide sequences, the time required for sequence similarity searches account for a large proportion of the total time. This time-consuming step makes it difficult to perform large-scale analyses. To analyze large-scale metagenomic data, such as those found in the human oral microbiome, we developed GHOST-MP (Genome-wide HOmology Search Tool on Massively Parallel system), a parallel sequence similarity search tool for massively parallel computing systems. This tool uses a fast search algorithm based on suffix arrays of query and database sequences and a hierarchical parallel search to accelerate the large-scale sequence similarity search of metagenomic sequencing data. The parallel computing efficiency and the search speed of this tool were evaluated. GHOST-MP was shown to be scalable over 10,000 CPU (Central Processing Unit) cores, and achieved over 80-fold acceleration compared with mpiBLAST using the same computational resources. We applied this tool to human oral metagenomic data, and the results indicate that the oral cavity, the oral vestibule, and plaque have different characteristics based on the functional gene category.

Keywords: database search; human oral microbiome; metagenomics; sequence similarity search.

MeSH terms

Algorithms
Humans
Metagenome / genetics*
Metagenomics / methods*
Microbiota / genetics*
Mouth / microbiology*
Sequence Analysis, DNA / methods*
Sequence Homology, Nucleic Acid*
Software*