Annotating German Clinical Documents for De-Identification

Tobias Kolditz; Christina Lohr; Johannes Hellrich; Luise Modersohn; Boris Betz; Michael Kiehntopf; Udo Hahn

doi:10.3233/SHTI190212

Annotating German Clinical Documents for De-Identification

Stud Health Technol Inform. 2019 Aug 21:264:203-207. doi: 10.3233/SHTI190212.

Authors

Tobias Kolditz¹, Christina Lohr¹, Johannes Hellrich¹, Luise Modersohn¹, Boris Betz², Michael Kiehntopf², Udo Hahn¹

Affiliations

¹ Jena University Language & Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, Jena, Germany.
² Institute of Clinical Chemistry and Laboratory Diagnostics, Jena University Hospital, Jena, Germany.

PMID: 31437914
DOI: 10.3233/SHTI190212

Abstract

We devised annotation guidelines for the de-identification of German clinical documents and assembled a corpus of 1,106 discharge summaries and transfer letters with 44K annotated protected health information (PHI) items. After three iteration rounds, our annotation team finally reached an inter-annotator agreement of 0.96 on the instance level and 0.97 on the token level of annotation (averaged pair-wise F1 score). To establish a baseline for automatic de-identification on our corpus, we trained a recurrent neural network (RNN) and achieved F1 scores greater than 0.9 on most major PHI categories.

Keywords: Confidentiality; Data Anonymization; Natural Language Processing.

MeSH terms

Data Anonymization*
Electronic Health Records*
Natural Language Processing
Neural Networks, Computer