An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering

Meijing Li; Tianjie Chen; Keun Ho Ryu; Cheng Hao Jin

doi:10.1155/2021/7937573

An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering

Comput Math Methods Med. 2021 Nov 9:2021:7937573. doi: 10.1155/2021/7937573. eCollection 2021.

Authors

Meijing Li¹, Tianjie Chen¹, Keun Ho Ryu^{2

3

4}, Cheng Hao Jin⁵

Affiliations

¹ College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
² Data Science Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh 700000, Vietnam.
³ Biomedical Engineering Institute, Chiang Mai University, Chiang Mai 50200, Thailand.
⁴ Department of Computer Science, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea.
⁵ ENN Research Institute of Digital Technology, Beijing 100096, China.

Abstract

Semantic mining is always a challenge for big biomedical text data. Ontology has been widely proved and used to extract semantic information. However, the process of ontology-based semantic similarity calculation is so complex that it cannot measure the similarity for big text data. To solve this problem, we propose a parallelized semantic similarity measurement method based on Hadoop MapReduce for big text data. At first, we preprocess and extract the semantic features from documents. Then, we calculate the document semantic similarity based on ontology network structure under MapReduce framework. Finally, based on the generated semantic document similarity, document clusters are generated via clustering algorithms. To validate the effectiveness, we use two kinds of open datasets. The experimental results show that the traditional methods can hardly work for more than ten thousand biomedical documents. The proposed method keeps efficient and accurate for big dataset and is of high parallelism and scalability.

MeSH terms

Algorithms
Big Data*
Biological Ontologies / statistics & numerical data
Cluster Analysis*
Computational Biology
Data Mining / methods*
Data Mining / statistics & numerical data
Documentation / methods
Documentation / statistics & numerical data
Humans
MEDLINE / statistics & numerical data
Machine Learning
Semantics*