Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets

Samer El Kababji; Nicholas Mitsakakis; Xi Fang; Ana-Alicia Beltran-Bless; Greg Pond; Lisa Vandermeer; Dhenuka Radhakrishnan; Lucy Mosquera; Alexander Paterson; Lois Shepherd; Bingshu Chen; William E Barlow; Julie Gralow; Marie-France Savard; Mark Clemons; Khaled El Emam

doi:10.1200/CCI.23.00116

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets

JCO Clin Cancer Inform. 2023 Sep:7:e2300116. doi: 10.1200/CCI.23.00116.

Authors

Samer El Kababji¹, Nicholas Mitsakakis¹, Xi Fang², Ana-Alicia Beltran-Bless^{3

4}, Greg Pond⁵, Lisa Vandermeer³, Dhenuka Radhakrishnan^{1

6}, Lucy Mosquera^{1

2}, Alexander Paterson⁷, Lois Shepherd⁸, Bingshu Chen⁸, William E Barlow⁹, Julie Gralow¹⁰, Marie-France Savard^{3

4}, Mark Clemons^{3

4}, Khaled El Emam^{1

2

11}

Affiliations

¹ CHEO Research Institute, Ottawa, ON, Canada.
² Replica Analytics Ltd, Ottawa, ON, Canada.
³ Ottawa Hospital Research Institute, Ottawa, ON, Canada.
⁴ Division of Medical Oncology, Department of Medicine, University of Ottawa, ON, Canada.
⁵ McMaster University, Hamilton, ON, Canada.
⁶ Department of Paediatrics, University of Ottawa, Ottawa, ON, Canada.
⁷ Alberta Health Services, Edmonton, AB, Canada.
⁸ Queen's University, Kingston, ON, Canada.
⁹ Cancer Research and Biostatistics, Seattle, WA.
¹⁰ University of Washington, Seattle, WA.
¹¹ School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.

PMID: 38011617
PMCID: PMC10703127 (available on 2024-11-27)
DOI: 10.1200/CCI.23.00116

Abstract

Purpose: There is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques.

Methods: We synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk.

Results: Utility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models.

Discussion: Synthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.

MeSH terms

Breast Neoplasms* / diagnosis
Breast Neoplasms* / therapy
Female
Humans
Medical Oncology
Privacy*
Research Personnel