Announcement of the German Medical Text Corpus Project (GeMTeX)

Stud Health Technol Inform. 2023 May 18:302:835-836. doi: 10.3233/SHTI230283.

Abstract

The largest publicly funded project to generate a German-language medical text corpus will start in mid-2023. GeMTeX comprises clinical texts from information systems of six university hospitals, which will be made accessible for NLP by annotation of entities and relations, which will be enhanced with additional meta-information. A strong governance provides a stable legal framework for the use of the corpus. State-of-the art NLP methods are used to build, pre-annotate and annotate the corpus and train language models. A community will be built around GeMTeX to ensure its sustainable maintenance, use, and dissemination.

Keywords: German Medical Informatics Initiative; Natural Language Processing; Text Corpus.

MeSH terms

  • Humans
  • Language*
  • Natural Language Processing*