A note on the interpretation of tree-based regression models

Biom J. 2020 Oct;62(6):1564-1573. doi: 10.1002/bimj.201900195. Epub 2020 May 25.

Abstract

Tree-based models are a popular tool for predicting a response given a set of explanatory variables when the regression function is characterized by a certain degree of complexity. Sometimes, they are also used to identify important variables and for variable selection. We show that if the generating model contains chains of direct and indirect effects, then the typical variable importance measures suggest selecting as important mainly the background variables, which have a strong indirect effect, disregarding the variables that directly influence the response. This is attributable mainly to the variable choice in the first steps of the algorithm selecting the splitting variable and to the greedy nature of such search. This pitfall could be relevant when using tree-based algorithms for understanding the underlying generating process, for population segmentation and for causal inference.

Keywords: interpretable machine learning; marginal and conditional dependence; underlying explanatory process; variable importance; variable selection bias.

MeSH terms

  • Algorithms*
  • Models, Statistical*
  • Regression Analysis*