Clustering Similar Diagnosis Terms

Stefan Schulz; Akhila Abdulnazar; Markus Kreuzthaler

doi:10.3233/SHTI230284

Clustering Similar Diagnosis Terms

Stud Health Technol Inform. 2023 May 18:302:837-838. doi: 10.3233/SHTI230284.

Authors

Stefan Schulz¹, Akhila Abdulnazar¹, Markus Kreuzthaler¹

Affiliation

¹ IMI, Medical University of Graz, Austria.

PMID: 37203513
DOI: 10.3233/SHTI230284

Abstract

A large clinical diagnosis list is explored with the goal to cluster syntactic variants. A string similarity heuristic is compared with a deep learning-based approach. Levenshtein distance (LD) applied to common words only (not tolerating deviations in acronyms and tokens with numerals), together with pair-wise substring expansions raised F1 to 13% above baseline (plain LD), with a maximum F1 of 0.71. In contrast, the model-based approach trained on a German medical language model did not perform better than the baseline, not exceeding an F1 value of 0.42.

Keywords: Electronic Health Records; Named Entity Normalization.

MeSH terms

Cluster Analysis
Electronic Health Records
Language*
Natural Language Processing*
Records