A Scalable Privacy-preserving Data Generation Methodology for Exploratory Analysis

AMIA Annu Symp Proc. 2018 Apr 16:2017:1695-1704. eCollection 2017.

Abstract

Big data coupled with precision medicine has the potential to significantly improve our understanding and treatment of complex disorders, such as cancer, diabetes, depression, etc. However, the essential problem is that data are stuck in silos, and it is difficult to precisely identify which data would be relevant and useful for any particular type of analysis. While the process to acquire and access biomedical data requires significant effort, in many cases the data may not provide much insight to the problem at hand. Therefore, there is a need to be able to measure the utility/relevance of additional datasets for a particular biomedical research task without direct access to the data. Towards this, in this paper, we develop a privacy-preserving approach to create synthetic data that can provide a firstorder approximation of utility. We evaluate the proposed approach with several biomedical datasets in the context of regression and classification tasks and discuss how it can be incorporated into existing data management systems such as REDCap.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Big Data
  • Biomedical Research*
  • Computer Security*
  • Datasets as Topic*
  • Humans
  • Privacy*