Clinical Notes De-Identification: Scoping Recent Benchmarks for n2c2 Datasets

Stud Health Technol Inform. 2022 Jan 14:289:293-296. doi: 10.3233/SHTI210917.

Abstract

Publicly shared repositories play an important role in advancing performance benchmarks for some of the most important tasks in natural language processing (NLP) and healthcare in general. This study reviews most recent benchmarks based on the 2014 n2c2 de-identification dataset. Pre-processing challenges were uncovered, and attention brought to the discrepancies in reported number of Protected Health Information (PHI) entities among the studies. Improved reporting is required for greater transparency and reproducibility.

Keywords: NLP; Natural language processing; de-identification; i2b2.

MeSH terms

  • Benchmarking*
  • Electronic Health Records*
  • Natural Language Processing
  • Reproducibility of Results