Clustering Similar Diagnosis Terms

Stud Health Technol Inform. 2023 May 18:302:837-838. doi: 10.3233/SHTI230284.

Abstract

A large clinical diagnosis list is explored with the goal to cluster syntactic variants. A string similarity heuristic is compared with a deep learning-based approach. Levenshtein distance (LD) applied to common words only (not tolerating deviations in acronyms and tokens with numerals), together with pair-wise substring expansions raised F1 to 13% above baseline (plain LD), with a maximum F1 of 0.71. In contrast, the model-based approach trained on a German medical language model did not perform better than the baseline, not exceeding an F1 value of 0.42.

Keywords: Electronic Health Records; Named Entity Normalization.

MeSH terms

  • Cluster Analysis
  • Electronic Health Records
  • Language*
  • Natural Language Processing*
  • Records