Clinical Notes De-Identification: Scoping Recent Benchmarks for n2c2 Datasets

Taridzo Chomutare

doi:10.3233/SHTI210917

Clinical Notes De-Identification: Scoping Recent Benchmarks for n2c2 Datasets

Stud Health Technol Inform. 2022 Jan 14:289:293-296. doi: 10.3233/SHTI210917.

Author

Taridzo Chomutare¹

Affiliation

¹ Norwegian Centre for E-health Research, Tromsø, Norway.

PMID: 35062150
DOI: 10.3233/SHTI210917

Abstract

Publicly shared repositories play an important role in advancing performance benchmarks for some of the most important tasks in natural language processing (NLP) and healthcare in general. This study reviews most recent benchmarks based on the 2014 n2c2 de-identification dataset. Pre-processing challenges were uncovered, and attention brought to the discrepancies in reported number of Protected Health Information (PHI) entities among the studies. Improved reporting is required for greater transparency and reproducibility.

Keywords: NLP; Natural language processing; de-identification; i2b2.

MeSH terms

Benchmarking*
Electronic Health Records*
Natural Language Processing
Reproducibility of Results