Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation

Georgios Stavropoulos; Robert van Vorstenbosch; Daisy M A E Jonkers; John Penders; Jane E Hill; Frederik-Jan van Schooten; Agnieszka Smolinska

doi:10.1016/j.aca.2021.339001

Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation

Anal Chim Acta. 2021 Oct 23:1183:339001. doi: 10.1016/j.aca.2021.339001. Epub 2021 Aug 28.

Authors

Georgios Stavropoulos¹, Robert van Vorstenbosch¹, Daisy M A E Jonkers², John Penders³, Jane E Hill⁴, Frederik-Jan van Schooten¹, Agnieszka Smolinska⁵

Affiliations

¹ Department of Pharmacology and Toxicology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands.
² Division of Gastroenterology and Hepatology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands.
³ Department of Medical Microbiology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands.
⁴ Department of Chemical and Biological Engineering, School of Biomedical Engineering, The University of British Columbia, Vancouver, Canada.
⁵ Department of Pharmacology and Toxicology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands. Electronic address: a.smolinska@maastrichtuniversity.nl.

PMID: 34627524
DOI: 10.1016/j.aca.2021.339001

Abstract

Data fusion has gained much attention in the field of life sciences, and this is because analysis of biological samples may require the use of data coming from multiple complementary sources to express the samples fully. Data fusion lies in the idea that different data platforms detect different biological entities. Therefore, if these different biological compounds are then combined, they can provide comprehensive profiling and understanding of the research question in hand. Data fusion can be performed in three different traditional ways: low-level, mid-level, and high-level data fusion. However, the increasing complexity and amount of generated data require the development of more sophisticated fusion approaches. In that regard, the current study presents an advanced data fusion approach (i.e. proximities stacking) based on random forest proximities coupled with the pseudo-sample principle. Four different data platforms of 130 samples each (faecal microbiome, blood, blood headspace, and exhaled breath samples of patients who have Crohn's disease) were used to demonstrate the classification performance of this new approach. More specifically, 104 samples were used to train and validate the models, whereas the remaining 26 samples were used to validate the models externally. Mid-level, high-level, as well as individual platform classification predictions, were made and compared against the proximities stacking approach. The performance of each approach was assessed by calculating the sensitivity and specificity of each model for the external test set, and visualized by performing principal component analysis on the proximity matrices of the training samples to then, subsequently, project the test samples onto that space. The implementation of pseudo-samples allowed for the identification of the most important variables per platform, finding relations among variables of the different data platforms, and the examination of how variables behave in the samples. The proximities stacking approach outperforms both mid-level and high-level fusion approaches, as well as all individual platform predictions. Concurrently, it tackles significant bottlenecks of the traditional ways of fusion and of another advanced fusion way discussed in the paper, and finally, it contradicts the general belief that the more data, the merrier the result, and therefore, considerations have to be taken into account before any data fusion analysis is conducted.

Keywords: Classification; Crohn's disease; Data fusion; Proximities; Stacking; Variable behaviour.

MeSH terms

Biological Science Disciplines*
Data Interpretation, Statistical*
Data Management
Humans