De-Identifying Swedish EHR Text Using Public Resources in the General Domain

Taridzo Chomutare; Kassaye Yitbarek Yigzaw; Andrius Budrionis; Alexandra Makhlysheva; Fred Godtliebsen; Hercules Dalianis

doi:10.3233/SHTI200140

De-Identifying Swedish EHR Text Using Public Resources in the General Domain

Stud Health Technol Inform. 2020 Jun 16:270:148-152. doi: 10.3233/SHTI200140.

Authors

Taridzo Chomutare¹, Kassaye Yitbarek Yigzaw¹, Andrius Budrionis¹, Alexandra Makhlysheva¹, Fred Godtliebsen^{1

2}, Hercules Dalianis^{1

3}

Affiliations

¹ Norwegian Centre for E-health Research, Tromsø, Norway.
² Faculty of Science & Technology, UiT - The Arctic University of Norway.
³ Department of Computer and Systems Sciences, Stockholm University, Sweden.

PMID: 32570364
DOI: 10.3233/SHTI200140

Abstract

Sensitive data is normally required to develop rule-based or train machine learning-based models for de-identifying electronic health record (EHR) clinical notes; and this presents important problems for patient privacy. In this study, we add non-sensitive public datasets to EHR training data; (i) scientific medical text and (ii) Wikipedia word vectors. The data, all in Swedish, is used to train a deep learning model using recurrent neural networks. Tests on pseudonymized Swedish EHR clinical notes showed improved precision and recall from 55.62% and 80.02% with the base EHR embedding layer, to 85.01% and 87.15% when Wikipedia word vectors are added. These results suggest that non-sensitive text from the general domain can be used to train robust models for de-identifying Swedish clinical text; and this could be useful in cases where the data is both sensitive and in low-resource languages.

Keywords: EHR; clinical text; de-identification; deep learning; wiki word vectors.

MeSH terms

Electronic Health Records*
Language
Machine Learning
Natural Language Processing
Sweden