A combined recall and rank framework with online negative sampling for Chinese procedure terminology normalization

Bioinformatics. 2021 Oct 25;37(20):3610-3617. doi: 10.1093/bioinformatics/btab381.

Abstract

Motivation: Medical terminology normalization aims to map the clinical mention to terminologies coming from a knowledge base, which plays an important role in analyzing electronic health record and many downstream tasks. In this article, we focus on Chinese procedure terminology normalization. The expressions of terminology are various and one medical mention may be linked to multiple terminologies. Existing studies based on learning to rank does not fully consider the quality of negative samples during model training and the importance of keywords in this domain-specific task.

Results: We propose a combined recall and rank framework to solve these problems. A pair-wise Bert model with deep metric learning is used to recall candidates. Previous methods either train Bert in a point-wise way or based on a multi-class classification problem, which may lead serious efficiency problems or not be effective enough. During model training, we design a novel online negative sampling algorithm to activate the pair-wise method. To deal with multi-implication scenarios, we train the task of implication number prediction together with the recall task in a multi-task learning setting, since these two tasks are highly complementary. In rank step, we propose a keywords attentive mechanism to focus on domain-specific information such as procedure sites and procedure types. Finally, a fusion block merges the results of the recall and the rank model. Detailed experimental analysis shows our proposed framework has a remarkable improvement on both performance and efficiency.

Availability and implementation: The source code will be available at https://github.com/sxthunder/CMTN upon publication.

Grants and funding