How Confident Are We about Observational Findings in Healthcare: A Benchmark Study

Martijn J Schuemie; M Soledad Cepeda; Marc A Suchard; Jianxiao Yang; Yuxi Tian; Alejandro Schuler; Patrick B Ryan; David Madigan; George Hripcsak

doi:10.1162/99608f92.147cc28e

How Confident Are We about Observational Findings in Healthcare: A Benchmark Study

Harv Data Sci Rev. 2020;2(1):10.1162/99608f92.147cc28e. doi: 10.1162/99608f92.147cc28e. Epub 2020 Jan 31.

Authors

Martijn J Schuemie^{1

2

3}, M Soledad Cepeda^{1

2}, Marc A Suchard^{1

3

4

5}, Jianxiao Yang^{1

4}, Yuxi Tian^{1

4}, Alejandro Schuler^{1

6}, Patrick B Ryan^{1

2

7}, David Madigan^{1

8}, George Hripcsak^{1

7

9}

Affiliations

¹ Observational Health Data Sciences and Informatics.
² Epidemiology Analytics, Janssen Research and Development.
³ Department of Biostatistics, University of California, Los Angeles.
⁴ Department of Biomathematics, University of California, Los Angeles.
⁵ Department of Human Genetics, University of California, Los Angeles.
⁶ Center for Biomedical Informatics Research, Stanford University.
⁷ Department of Biomedical Informatics, Columbia University.
⁸ Department of Statistics, Columbia University.
⁹ Medical Informatics Services, New York-Presbyterian Hospital.

Abstract

Healthcare professionals increasingly rely on observational healthcare data, such as administrative claims and electronic health records, to estimate the causal effects of interventions. However, limited prior studies raise concerns about the real-world performance of the statistical and epidemiological methods that are used. We present the "OHDSI Methods Benchmark" that aims to evaluate the performance of effect estimation methods on real data. The benchmark comprises a gold standard, a set of metrics, and a set of open source software tools. The gold standard is a collection of real negative controls (drug-outcome pairs where no causal effect appears to exist) and synthetic positive controls (drug-outcome pairs that augment negative controls with simulated causal effects). We apply the benchmark using four large healthcare databases to evaluate methods commonly used in practice: the new-user cohort, self-controlled cohort, case-control, case-crossover, and self-controlled case series designs. The results confirm the concerns about these methods, showing that for most methods the operating characteristics deviate considerably from nominal levels. For example, in most contexts, only half of the 95% confidence intervals we calculated contain the corresponding true effect size. We previously developed an "empirical calibration" procedure to restore these characteristics and we also evaluate this procedure. While no one method dominates, self-controlled methods such as the empirically calibrated self-controlled case series perform well across a wide range of scenarios.

Keywords: causal effect estimation; evaluation; methods; observational research.

Abstract

Grants and funding