Population scale latent space cohort matching for the improved use and exploration of observational trial data

Rachel Gologorsky; Sulaiman S Somani; Sean N Neifert; Aly A Valliani; Katherine E Link; Viola J Chen; Anthony B Costa; Eric K Oermann

doi:10.3934/mbe.2022320

Population scale latent space cohort matching for the improved use and exploration of observational trial data

Math Biosci Eng. 2022 May 5;19(7):6795-6813. doi: 10.3934/mbe.2022320.

Authors

Rachel Gologorsky¹, Sulaiman S Somani², Sean N Neifert³, Aly A Valliani¹, Katherine E Link¹, Viola J Chen⁴, Anthony B Costa⁵, Eric K Oermann^{3

6}

Affiliations

¹ Department of Medicine, Icahn School of Medicine, New York, NY 10028, USA.
² Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA.
³ Department of Neurosurgery, NYU Grossman School of Medicine, New York, NY 10016, USA.
⁴ Oncology Early development, Merck & Co., Inc, Kenilworth, NJ 07033, USA.
⁵ NVIDIA, Santa Clara, CA 95051, USA.
⁶ Department of Radiology, NYU Grossman School of Medicine, New York, NY 10016, USA.

PMID: 35730283
DOI: 10.3934/mbe.2022320

Abstract

A significant amount of clinical research is observational by nature and derived from medical records, clinical trials, and large-scale registries. While there is no substitute for randomized, controlled experimentation, such experiments or trials are often costly, time consuming, and even ethically or practically impossible to execute. Combining classical regression and structural equation modeling with matching techniques can leverage the value of observational data. Nevertheless, identifying variables of greatest interest in high-dimensional data is frequently challenging, even with application of classical dimensionality reduction and/or propensity scoring techniques. Here, we demonstrate that projecting high-dimensional medical data onto a lower-dimensional manifold using deep autoencoders and post-hoc generation of treatment/control cohorts based on proximity in the lower-dimensional space results in better matching of confounding variables compared to classical propensity score matching (PSM) in the original high-dimensional space (P<0.0001) and performs similarly to PSM models constructed by experts with prior knowledge of the underlying pathology when evaluated on predicting risk ratios from real-world clinical data. Thus, in cases when the underlying problem is poorly understood and the data is high-dimensional in nature, matching in the autoencoder latent space might be of particular benefit.

Keywords: artificial intelligence; autoencoders; cohort matching; data visualization; deep learning; manifold learning.

Publication types

Observational Study

MeSH terms

Cohort Studies
Humans
Propensity Score
Research Design*