Privacy preserving Generative Adversarial Networks to model Electronic Health Records

Rohit Venugopal; Noman Shafqat; Ishwar Venugopal; Benjamin Mark John Tillbury; Harry Demetrios Stafford; Aikaterini Bourazeri

doi:10.1016/j.neunet.2022.06.022

Privacy preserving Generative Adversarial Networks to model Electronic Health Records

Neural Netw. 2022 Sep:153:339-348. doi: 10.1016/j.neunet.2022.06.022. Epub 2022 Jun 25.

Authors

Rohit Venugopal¹, Noman Shafqat¹, Ishwar Venugopal¹, Benjamin Mark John Tillbury¹, Harry Demetrios Stafford¹, Aikaterini Bourazeri²

Affiliations

¹ School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe Park, Colchester CO4 3SQ, United Kingdom.
² School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe Park, Colchester CO4 3SQ, United Kingdom. Electronic address: a.bourazeri@essex.ac.uk.

PMID: 35779443
DOI: 10.1016/j.neunet.2022.06.022

Abstract

Hospitals and General Practitioner (GP) surgeries within National Health Services (NHS), collect patient information on a routine basis to create personal health records such as family medical history, chronic diseases, medications and dosing. The collected information could be used to build and model various machine learning algorithms, to simplify the task of those working within the NHS. However, such Electronic Health Records are not made publicly available due to privacy concerns. In our paper, we propose a privacy-preserving Generative Adversarial Network (pGAN), which can generate synthetic data of high quality, while preserving the privacy and statistical properties of the source data. pGAN is evaluated on two distinct datasets, one posing as a Classification task, and the other as a Regression task. Privacy score of generated data is calculated using the Nearest Neighbour Adversarial Accuracy. Cosine similarity scores of synthetic data from our proposed model indicate that the data generated is similar in nature, but not identical. Additionally, our proposed model was able to preserve privacy while maintaining high utility. Machine learning models trained on both synthetic data and original data have achieved accuracies of 74.3% and 74.5% respectively on the classification dataset; while they have attained an R2-Score of 0.84 and 0.85 on synthetic and original data of the regression task respectively. Our results, therefore, indicate that synthetic data from the proposed model could replace the use of original data for machine learning while preserving privacy.

Keywords: AI; GAN; Machine learning; Privacy; Public health data.

MeSH terms

Algorithms
Data Collection
Electronic Health Records*
Humans
Machine Learning
Privacy*