Variable Selection in Bayesian Multiple Instance Regression using Shotgun Stochastic Search

Seongoh Park; Joungyoun Kim; Xinlei Wang; Johan Lim

doi:10.1016/j.csda.2024.107954

Variable Selection in Bayesian Multiple Instance Regression using Shotgun Stochastic Search

Comput Stat Data Anal. 2024 Aug:196:107954. doi: 10.1016/j.csda.2024.107954. Epub 2024 Mar 24.

Authors

Seongoh Park^{1

2}, Joungyoun Kim³, Xinlei Wang^{4

5}, Johan Lim⁶

Affiliations

¹ School of Mathematics, Statistics and Data Science, Sungshin Women's University, Seoul, Korea.
² Data Science Center, Sungshin Women's University, Seoul, Korea.
³ Department of Artificial Intelligence, University of Seoul, Seoul, Korea.
⁴ Center for Data Science Research and Education, College of Science, University of Texas at Arlington, Arlington, TX, USA.
⁵ Department of Mathematics, University of Texas at Arlington, Arlington, TX, USA.
⁶ Department of Statistics, Seoul National University, Seoul, 08826, Korea.

PMID: 38646418
PMCID: PMC11027161 (available on 2025-08-01)
DOI: 10.1016/j.csda.2024.107954

Abstract

In multiple instance learning (MIL), a bag represents a sample that has a set of instances, each of which is described by a vector of explanatory variables, but the entire bag only has one label/response. Though many methods for MIL have been developed to date, few have paid attention to interpretability of models and results. The proposed Bayesian regression model stands on two levels of hierarchy, which transparently show how explanatory variables explain and instances contribute to bag responses. Moreover, two selection problems are simultaneously addressed; the instance selection to find out the instances in each bag responsible for the bag response, and the variable selection to search for the important covariates. To explore a joint discrete space of indicator variables created for selection of both explanatory variables and instances, the shotgun stochastic search algorithm is modified to fit in the MIL context. Also, the proposed model offers a natural and rigorous way to quantify uncertainty in coefficient estimation and outcome prediction, which many modern MIL applications call for. The simulation study shows the proposed regression model can select variables and instances with high performance (AUC greater than 0.86), thus predicting responses well. The proposed method is applied to the musk data for prediction of binding strengths (labels) between molecules (bags) with different conformations (instances) and target receptors. It outperforms all existing methods, and can identify variables relevant in modeling responses.

Keywords: MCMC; Multiple instance learning; binding affinity prediction; hierarchical model; model selection; musk data.

Grants and funding

R01 CA258584/CA/NCI NIH HHS/United States