DE-Lemma: A Maximum-Entropy Based Lemmatizer for German Medical Text

Stud Health Technol Inform. 2023 Sep 12:307:189-195. doi: 10.3233/SHTI230712.

Abstract

When processing written German language, it is helpful, to use the base form (or: lemma) of possibly inflected words, such as verbs, nouns or named entities. However, for German text from the (bio)medical domain, e.g., discharge letters, or entries stored in electronic medical or health records (EMR, EHR), difficulties exist in finding the correct lemma, as, for instance, the medical language has roots in Latin or Greek. In such cases, stemming techniques might provide inaccurate results for text written in German. This study demonstrates a Machine Learning approach for training Apache OpenNLP-based lemmatizer models from publicly available German treebanks. The resulting four "DE-Lemma" models were evaluated against a sample of (bio)medical nouns, randomly selected from real-world discharge letters. The most promising DE-Lemma model achieved an accuracy of 88.0% (F1 = .936).

Keywords: Machine Learning; Natural Language Processing; Text Mining.

MeSH terms

  • APACHE
  • Electronic Health Records*
  • Entropy
  • Language*
  • Machine Learning