Assessing transcriptomic reidentification risks using discriminative sequence models

Shuvom Sadhuka; Daniel Fridman; Bonnie Berger; Hyunghoon Cho

doi:10.1101/gr.277699.123

Assessing transcriptomic reidentification risks using discriminative sequence models

Genome Res. 2023 Jul;33(7):1101-1112. doi: 10.1101/gr.277699.123. Epub 2023 Aug 4.

Authors

Shuvom Sadhuka^{1

2}, Daniel Fridman^{2

3}, Bonnie Berger^{1

2}, Hyunghoon Cho⁴

Affiliations

¹ Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.
² Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.
³ Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115, USA.
⁴ Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA; bab@mit.edu hhcho@broadinstitute.org.

Abstract

Gene expression data provide molecular insights into the functional impact of genetic variation, for example, through expression quantitative trait loci (eQTLs). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another data set, known as a linking attack. Prior works show such a risk could analyze only a fraction of eQTLs that is independent owing to restrictive model assumptions, leaving the full extent of this risk incompletely understood. To address this challenge, we introduce the discriminative sequence model (DSM), a novel probabilistic framework for predicting a sequence of genotypes based on gene expression data. By modeling the joint distribution over all known eQTLs in a genomic region, DSM improves the power of linking attacks with necessary calibration for linkage disequilibrium and redundant predictive signals. We show greater linking accuracy of DSM compared with existing approaches across a range of attack scenarios and data sets including up to 22,288 individuals, suggesting that DSM helps uncover a substantial additional risk overlooked by previous studies. Our work provides a unified framework for assessing the privacy risks of sharing diverse omics data sets beyond transcriptomics.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Research Support, N.I.H., Extramural

MeSH terms

Gene Expression Profiling
Genome-Wide Association Study*
Genotype
Humans
Polymorphism, Single Nucleotide
Quantitative Trait Loci
Transcriptome*

Abstract

Publication types

MeSH terms

Grants and funding