ParaBTM: A Parallel Processing Framework for Biomedical Text Mining on Supercomputers

Yuting Xing; Chengkun Wu; Xi Yang; Wei Wang; En Zhu; Jianping Yin

doi:10.3390/molecules23051028

ParaBTM: A Parallel Processing Framework for Biomedical Text Mining on Supercomputers

Molecules. 2018 Apr 27;23(5):1028. doi: 10.3390/molecules23051028.

Authors

Yuting Xing¹, Chengkun Wu², Xi Yang³, Wei Wang⁴, En Zhu⁵, Jianping Yin⁶

Affiliations

¹ School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China. xingyuting16@nudt.edu.cn.
² School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China. chengkun_wu@nudt.edu.cn.
³ School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China. yangxi1016@nudt.edu.cn.
⁴ School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China. g.webywang@gmail.com.
⁵ School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China. enzhu@nudt.edu.cn.
⁶ School of Computer Science and Network Security, Dongguan University of Technology, Dongguan, Guangdong 523808, China. jpyin@dgut.edu.cn.

Abstract

A prevailing way of extracting valuable information from biomedical literature is to apply text mining methods on unstructured texts. However, the massive amount of literature that needs to be analyzed poses a big data challenge to the processing efficiency of text mining. In this paper, we address this challenge by introducing parallel processing on a supercomputer. We developed paraBTM, a runnable framework that enables parallel text mining on the Tianhe-2 supercomputer. It employs a low-cost yet effective load balancing strategy to maximize the efficiency of parallel processing. We evaluated the performance of paraBTM on several datasets, utilizing three types of named entity recognition tasks as demonstration. Results show that, in most cases, the processing efficiency can be greatly improved with parallel processing, and the proposed load balancing strategy is simple and effective. In addition, our framework can be readily applied to other tasks of biomedical text mining besides NER.

Keywords: Tianhe-2; big data; biomedical text mining; load balancing; parallel computing.

MeSH terms

Algorithms
Biomedical Research
Data Mining / methods*
Electronic Data Processing / instrumentation*
Humans