LAVASET: Latent Variable Stochastic Ensemble of Trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies

Melpomeni Kasapi; Kexin Xu; Timothy M D Ebbels; Declan P O'Regan; James S Ware; Joram M Posma

doi:10.1093/bioinformatics/btae101

LAVASET: Latent Variable Stochastic Ensemble of Trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies

Bioinformatics. 2024 Mar 4;40(3):btae101. doi: 10.1093/bioinformatics/btae101.

Authors

Melpomeni Kasapi^{1

2

3}, Kexin Xu¹, Timothy M D Ebbels¹, Declan P O'Regan^{2

3}, James S Ware^{2

3

4

5}, Joram M Posma¹

Affiliations

¹ Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion, and Reproduction, Faculty of Medicine, Imperial College London, London W12 0NN, United Kingdom.
² Faculty of Medicine, National Heart & Lung Institute, Imperial College London, London W12 0NN, United Kingdom.
³ MRC London Institute of Medical Sciences, Imperial College London, London W12 0HS, United Kingdom.
⁴ Royal Brompton & Harefield Hospitals, Guy's and St. Thomas' NHS Foundation Trust, London SW3 6NP, United Kingdom.
⁵ Program in Medical & Population Genetics, Broad Institute of MIT & Harvard, Cambridge, MA 02142, United States.

PMID: 38383048
DOI: 10.1093/bioinformatics/btae101

Abstract

Motivation: Random forests (RFs) can deal with a large number of variables, achieve reasonable prediction scores, and yield highly interpretable feature importance values. As such, RFs are appropriate models for feature selection and further dimension reduction. However, RFs are often not appropriate for correlated datasets due to their mode of selecting individual features for splitting. Addressing correlation relationships in high-dimensional datasets is imperative for reducing the number of variables that are assigned high importance, hence making the dimension reduction most efficient. Here, we propose the LAtent VAriable Stochastic Ensemble of Trees (LAVASET) method that derives latent variables based on the distance characteristics of each feature and aims to incorporate the correlation factor in the splitting step.

Results: Without compromising on performance in the majority of examples, LAVASET outperforms RF by accurately determining feature importance across all correlated variables and ensuring proper distribution of importance values. LAVASET yields mostly non-inferior prediction accuracies to traditional RFs when tested in simulated and real 1D datasets, as well as more complex and high-dimensional 3D datatypes. Unlike traditional RFs, LAVASET is unaffected by single 'important' noisy features (false positives), as it considers the local neighbourhood. LAVASET, therefore, highlights neighbourhoods of features, reflecting real signals that collectively impact the model's predictive ability.

Availability and implementation: LAVASET is freely available as a standalone package from https://github.com/melkasapi/LAVASET.

Abstract

Grants and funding