CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

Jeongsu Oh; Chi-Hwan Choi; Min-Kyu Park; Byung Kwon Kim; Kyuin Hwang; Sang-Heon Lee; Soon Gyu Hong; Arshan Nasir; Wan-Sup Cho; Kyung Mo Kim

doi:10.1371/journal.pone.0151064

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

PLoS One. 2016 Mar 8;11(3):e0151064. doi: 10.1371/journal.pone.0151064. eCollection 2016.

Authors

Jeongsu Oh¹, Chi-Hwan Choi², Min-Kyu Park³, Byung Kwon Kim⁴, Kyuin Hwang⁵, Sang-Heon Lee^{1

6}, Soon Gyu Hong⁵, Arshan Nasir⁷, Wan-Sup Cho⁸, Kyung Mo Kim^{1

6}

Affiliations

¹ Microbial Resource Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Republic of Korea.
² Department of Bio-Information Technology, Chungbuk National University, CheongJu, Republic of Korea.
³ Department of Business Data Convergence, Chungbuk National University, CheongJu, Republic of Korea.
⁴ BioNano Health Guard Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Republic of Korea.
⁵ Division of Polar Life Sciences, Korea Polar Research Institute, Incheon, Republic of Korea.
⁶ Department of Bioinformatics, University of Science and Technology, Daejeon, Republic of Korea.
⁷ Department of Biosciences, COMSATS Institute of Information Technology, Islamabad, Pakistan.
⁸ Department of Management Information Systems/BK Plus Team, Chungbuk National University, CheongJu, Republic of Korea.

Abstract

High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology-a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cluster Analysis*
Computational Biology / methods
Environmental Microbiology*
Humans
RNA, Ribosomal, 16S / genetics*
Reproducibility of Results
Software*
Workflow

Substances

RNA, Ribosomal, 16S

Grants and funding

This material is based upon work supported by the KRIBB (http://www.kribb.re.kr) Research Initiative Program (to KMK) and by KOPRI (http://www.kopri.re.kr) research program under Grant No. PE15020 (to SGH). The funders had no role in studying design, data collection and analysis, decision to publish, or preparation of the manuscript.