Membership inference attacks against synthetic health data

Ziqi Zhang; Chao Yan; Bradley A Malin

doi:10.1016/j.jbi.2021.103977

Membership inference attacks against synthetic health data

J Biomed Inform. 2022 Jan:125:103977. doi: 10.1016/j.jbi.2021.103977. Epub 2021 Dec 14.

Authors

Ziqi Zhang¹, Chao Yan², Bradley A Malin³

Affiliations

¹ Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240, United States. Electronic address: ziqi.zhang@vanderbilt.edu.
² Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240, United States.
³ Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240, United States; Vanderbilt University Medical Center, 2525 West End Avenue, Nashville, TN 37240, United States.

Abstract

Synthetic data generation has emerged as a promising method to protect patient privacy while sharing individual-level health data. Intuitively, sharing synthetic data should reduce disclosure risks because no explicit linkage is retained between the synthetic records and the real data upon which it is based. However, the risks associated with synthetic data are still evolving, and what seems protected today may not be tomorrow. In this paper, we show that membership inference attacks, whereby an adversary infers if the data from certain target individuals (known to the adversary a priori) were relied upon by the synthetic data generation process, can be substantially enhanced through state-of-the-art machine learning frameworks, which calls into question the protective nature of existing synthetic data generators. Specifically, we formulate the membership inference problem from the perspective of the data holder, who aims to perform a disclosure risk assessment prior to sharing any health data. To support such an assessment, we introduce a framework for effective membership inference against synthetic health data without specific assumptions about the generative model or a well-defined data structure, leveraging the principles of contrastive representation learning. To illustrate the potential for such an attack, we conducted experiments against synthesis approaches using two datasets derived from several health data resources (Vanderbilt University Medical Center, the All of Us Research Program) to determine the upper bound of risk brought by an adversary who invokes an optimal strategy. The results indicate that partially synthetic data are vulnerable to membership inference at a very high rate. By contrast, fully synthetic data are only marginally susceptible and, in most cases, could be deemed sufficiently protected from membership inference.

Keywords: Contrastive representation learning; Electronic health record; Membership inference; Synthetic data.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Confidentiality
Disclosure
Genomics
Humans
Machine Learning
Population Health*

Abstract

Publication types

MeSH terms

Grants and funding