Managing re-identification risks while providing access to the All of Us research program

Weiyi Xia; Melissa Basford; Robert Carroll; Ellen Wright Clayton; Paul Harris; Murat Kantacioglu; Yongtai Liu; Steve Nyemba; Yevgeniy Vorobeychik; Zhiyu Wan; Bradley A Malin

doi:10.1093/jamia/ocad021

Managing re-identification risks while providing access to the All of Us research program

J Am Med Inform Assoc. 2023 Apr 19;30(5):907-914. doi: 10.1093/jamia/ocad021.

Authors

Weiyi Xia¹, Melissa Basford², Robert Carroll¹, Ellen Wright Clayton^{3

4

5}, Paul Harris^{1

6}, Murat Kantacioglu⁷, Yongtai Liu⁸, Steve Nyemba¹, Yevgeniy Vorobeychik⁹, Zhiyu Wan¹, Bradley A Malin^{1

8

10}

Affiliations

¹ Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
² Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
³ Law School, Vanderbilt University, Nashville, Tennessee, USA.
⁴ Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
⁵ Department of Health Policy, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
⁶ Department of Biomedical Engineering, Vanderbilt University, Nashville, Tennessee, USA.
⁷ Department of Computer Science, University of Texas at Dallas, Dallas, Texas, USA.
⁸ Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA.
⁹ Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, Missouri, USA.
¹⁰ Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.

Abstract

Objective: The All of Us Research Program makes individual-level data available to researchers while protecting the participants' privacy. This article describes the protections embedded in the multistep access process, with a particular focus on how the data was transformed to meet generally accepted re-identification risk levels.

Methods: At the time of the study, the resource consisted of 329 084 participants. Systematic amendments were applied to the data to mitigate re-identification risk (eg, generalization of geographic regions, suppression of public events, and randomization of dates). We computed the re-identification risk for each participant using a state-of-the-art adversarial model specifically assuming that it is known that someone is a participant in the program. We confirmed the expected risk is no greater than 0.09, a threshold that is consistent with guidelines from various US state and federal agencies. We further investigated how risk varied as a function of participant demographics.

Results: The results indicated that 95th percentile of the re-identification risk of all the participants is below current thresholds. At the same time, we observed that risk levels were higher for certain race, ethnic, and genders.

Conclusions: While the re-identification risk was sufficiently low, this does not imply that the system is devoid of risk. Rather, All of Us uses a multipronged data protection strategy that includes strong authentication practices, active monitoring of data misuse, and penalization mechanisms for users who violate terms of service.

Keywords: All of Us Research Program; data privacy; data sharing; electronic health records.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Computer Security
Female
Humans
Male
Population Health*
Privacy
Research Personnel
Risk Management

Abstract

Publication types

MeSH terms

Grants and funding