All that Glitters Is not Gold: Type-I Error Controlled Variable Selection from Clinical Trial Data

Clin Pharmacol Ther. 2024 Apr;115(4):774-785. doi: 10.1002/cpt.3211. Epub 2024 Feb 28.

Abstract

Clinical trials are primarily conducted to estimate causal effects, but the data collected can also be invaluable for additional research, such as identifying prognostic measures of disease or biomarkers that predict treatment efficacy. However, these exploratory settings are prone to false discoveries (type-I errors) due to the multiple comparisons they entail. Unfortunately, many methods fail to address this issue, in part because the algorithms used are generally designed to optimize predictions and often only provide the measures used for variable selection, such as machine learning model importance scores, as a byproduct. To address the resulting unclear uncertainty in the selection sets, the knockoff framework offers a model-agnostic, robust approach to variable selection with guaranteed type-I error control. Here, we review the knockoff framework in the setting of clinical data, highlighting main considerations using simulation studies. We also extend the framework by introducing a novel knockoff generation method that addresses two main limitations of previously suggested methods relevant for clinical development settings. With this new method, we empirically obtain tighter bounds on type-I error control and gain an order of magnitude in computational efficiency in mixed data settings. We demonstrate comparable selections to those of the competing method for identifying prognostic biomarkers for C-reactive protein levels in patients with psoriatic arthritis in four clinical trials. Our work increases access to the knockoff framework for variable selection from clinical trial data. Hereby, this paper helps to address the current replicability crisis which can result in unnecessary research efforts, increased patient burden, and avoidable costs.

Publication types

  • Review

MeSH terms

  • Algorithms*
  • Biomarkers
  • Computer Simulation
  • Humans
  • Machine Learning*
  • Uncertainty

Substances

  • Biomarkers