Research on named entity recognition of Traditional Chinese Medicine chest discomfort cases incorporating domain vocabulary features

Comput Biol Med. 2023 Sep 9:166:107466. doi: 10.1016/j.compbiomed.2023.107466. Online ahead of print.

Abstract

Objective: To promote research on knowledge extraction and knowledge graph construction of chest discomfort medical cases in Traditional Chinese Medicine (TCM), this paper focuses on their named entity recognition (NER). The recognition task includes six entity types: "syndrome", "symptom", "etiology and pathogenesis", "treatment method", "medication", and "prescription".

Methods: We annotated data in a BIO (B-begin, I-inside, O-outside) manner. For the characteristics of medical case texts, we proposed a custom dictionary method that can be dynamically updated for word segmentation. To compare the effect of the method on the experimental results, we applied the method in the BiLSTM-CRF model and IDCNN-CRF model, respectively.

Results: The models using custom dictionaries (BiLSTM-CRF-Loaded and IDCNN-CRF-Loaded) outperformed the models without custom dictionaries (BiLSTM-CRF and IDCNN-CRF) in accuracy, precision, recall, and F1 score. The BiLSTM-CRF-Loaded model yielded F1 scores of 92.59% and 93.23% on the test set and validation set, respectively, surpassing the BERT-BiLSTM-CRF model by 3.59% and 4.87%. Furthermore, when analyzing the results for the six entity categories separately, we found that the use of custom dictionaries has a marked impact, with the categories of "etiology and pathogenesis" and "syndrome" demonstrating the most noticeable improvements. By comparing the F1 scores, we observed that the entity category "medication" yielded the highest performance, reaching F1 scores of 96.04% and 96.48% on the test set and validation set, respectively.

Conclusion: We propose a word segmentation method based on a dynamically updated custom dictionary. The method is combined with the BILSTM-CRF and the IDCNN-CRF models, which enhances the model to recognize domain-specific terms and new entities. It can be widely applied in dealing with complex text structures and texts containing domain-specific terms.

Keywords: BERT-BiLSTM-CRF; BiLSTM-CRF; Chest discomfort; IDCNN-CRF; Named entity recognition; TCM medical cases.