Crosslingual named entity recognition for clinical de-identification applied to a COVID-19 Italian data set

Rosario Catelli; Francesco Gargiulo; Valentina Casola; Giuseppe De Pietro; Hamido Fujita; Massimo Esposito

doi:10.1016/j.asoc.2020.106779

Crosslingual named entity recognition for clinical de-identification applied to a COVID-19 Italian data set

Appl Soft Comput. 2020 Dec:97:106779. doi: 10.1016/j.asoc.2020.106779. Epub 2020 Oct 9.

Authors

Rosario Catelli^{1

2}, Francesco Gargiulo¹, Valentina Casola², Giuseppe De Pietro¹, Hamido Fujita^{3

4

5}, Massimo Esposito¹

Affiliations

¹ Institute for High Performance Computing and Networking (ICAR), National Research Council, Naples, Italy.
² Department of Electrical Engineering and Information Technologies (DIETI), University of Naples Federico II, Naples, Italy.
³ Faculty of Information Technology, Ho Chi Minh City University of Technology (HUTECH), Ho Chi Minh City, Viet Nam.
⁴ Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI), University of Granada, Granada, Spain.
⁵ Faculty of Software and Information Science, Iwate Prefectural University, Iwate, Japan.

Abstract

The COrona VIrus Disease 19 (COVID-19) pandemic required the work of all global experts to tackle it. Despite the abundance of new studies, privacy laws prevent their dissemination for medical investigations: through clinical de-identification, the Protected Health Information (PHI) contained therein can be anonymized so that medical records can be shared and published. The automation of clinical de-identification through deep learning techniques has proven to be less effective for languages other than English due to the scarcity of data sets. Hence a new Italian de-identification data set has been created from the COVID-19 clinical records made available by the Italian Society of Radiology (SIRM). Therefore, two multi-lingual deep learning systems have been developed for this low-resource language scenario: the objective is to investigate their ability to transfer knowledge between different languages while maintaining the necessary features to correctly perform the Named Entity Recognition task for de-identification. The systems were trained using four different strategies, using both the English Informatics for Integrating Biology & the Bedside (i2b2) 2014 and the new Italian SIRM COVID-19 data sets, then evaluated on the latter. These approaches have demonstrated the effectiveness of cross-lingual transfer learning to de-identify medical records written in a low resource language such as Italian, using one with high resources such as English.

Keywords: Annotated Italian data set; COVID-19; Clinical de-identification; Deep learning; Named entity recognition.