DataSifterText: Partially Synthetic Text Generation for Sensitive Clinical Notes

J Med Syst. 2022 Nov 16;46(12):96. doi: 10.1007/s10916-022-01880-6.

Abstract

Petabytes of health data are collected annually across the globe in electronic health records (EHR), including significant information stored as unstructured free text. However, the lack of effective mechanisms to securely share clinical text has inhibited its full utilization. We propose a new method, DataSifterText, to generate partially synthetic clinical free-text that can be safely shared between stakeholders (e.g., clinicians, STEM researchers, engineers, analysts, and healthcare providers), limiting the re-identification risk while providing significantly better utility preservation than suppressing or generalizing sensitive tokens. The method creates partially synthetic free-text data, which inherits the joint population distribution of the original data, and disguises the location of true and obfuscated words. Under certain obfuscation levels, the resulting synthetic text was sufficiently altered with different choices, orders, and frequencies of words compared to the original records. The differences were comparable to machine-generated (fully synthetic) text reported in previous studies. We applied DataSifterText to two medical case studies. In the CDC work injury application, using privacy protection, 60.9-86.5% of the synthetic descriptions belong to the same cluster as the original descriptions, demonstrating better utility preservation than the naïve content suppressing method (45.8-85.7%). In the MIMIC III application, the generated synthetic data maintained over 80% of the original information regarding patients' overall health conditions. The reported DataSifterText statistical obfuscation results indicate that the technique provides sufficient privacy protection (low identification risk) while preserving population-level information (high utility).

Keywords: AI; Clinical notes; Data science; ML; PHI; Synthetic data.

MeSH terms

  • Electronic Health Records*
  • Humans
  • Privacy*