Measuring re-identification risk using a synthetic estimator to enable data sharing

Yangdi Jiang; Lucy Mosquera; Bei Jiang; Linglong Kong; Khaled El Emam

doi:10.1371/journal.pone.0269097

Measuring re-identification risk using a synthetic estimator to enable data sharing

PLoS One. 2022 Jun 17;17(6):e0269097. doi: 10.1371/journal.pone.0269097. eCollection 2022.

Authors

Yangdi Jiang^{1

2}, Lucy Mosquera², Bei Jiang¹, Linglong Kong¹, Khaled El Emam^{2

3

4}

Affiliations

¹ Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada.
² Replica Analytics Ltd., Ottawa, Ontario, Canada.
³ School of Epidemiology and Public Health, University of Ottawa, Ottawa, Ontario, Canada.
⁴ Childrens Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada.

Abstract

Background: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population.

Objectives: Develop an accurate risk estimator for the sample-to-population attack.

Methods: A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature.

Results: Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset.

Conclusions: The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

COVID-19* / epidemiology
Humans
Information Dissemination
Privacy
Probability
Risk

Grants and funding

This work was partially funded by: (1) A Discovery Grant RGPIN-2016-06781 from the Natural Sciences and Engineering Research Council of Canada [KEE]: https://www.nserc-crsng.gc.ca/index_eng.asp (2) MITACS Accelerate grant [YJ]: https://www.mitacs.ca/en (3) Replica Analytics [LM]: https://replica-analytics.com/home NSERC and MITACS had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. LM is an employee of Replica Analytics and she participated in the design of the study. She also provided expertise on data synthesis methods for the execution of the project.