Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification

Jelmar Quist; Lawson Taylor; Johan Staaf; Anita Grigoriadis

doi:10.3390/cancers13050991

Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification

Cancers (Basel). 2021 Feb 27;13(5):991. doi: 10.3390/cancers13050991.

Authors

Jelmar Quist^{1

2

3}, Lawson Taylor^{1

2}, Johan Staaf⁴, Anita Grigoriadis^{1

2

3}

Affiliations

¹ Cancer Bioinformatics, Cancer Centre at Guy's Hospital, King's College London, London SE1 9RT, UK.
² School of Cancer and Pharmaceutical Sciences, King's College London, London SE1 1UL, UK.
³ Breast Cancer Now Research Unit, Cancer Centre at Guy's Hospital, King's College London, London SE1 9RT, UK.
⁴ Division of Oncology, Department of Clinical Sciences Lund, Lund University, Medicon Village, SE-223 81 Lund, Sweden.

Abstract

Advances in high-throughput technologies encourage the generation of large amounts of multiomics data to investigate complex diseases, including breast cancer. Given that the aetiologies of such diseases extend beyond a single biological entity, and that essential biological information can be carried by all data regardless of data type, integrative analyses are needed to identify clinically relevant patterns. To facilitate such analyses, we present a permutation-based framework for random forest methods which simultaneously allows the unbiased integration of mixed-type data and assessment of relative feature importance. Through simulation studies and machine learning datasets, the performance of the approach was evaluated. The results showed minimal multicollinearity and limited overfitting. To further assess the performance, the permutation-based framework was applied to high-dimensional mixed-type data from two independent breast cancer cohorts. Reproducibility and robustness of our approach was demonstrated by the concordance in relative feature importance between the cohorts, along with consistencies in clustering profiles. One of the identified clusters was shown to be prognostic for clinical outcome after standard-of-care adjuvant chemotherapy and outperformed current intrinsic molecular breast cancer classifications.

Keywords: DNA damage repair; breast cancer; integrative analysis; machine learning; random forest.

Abstract

Grants and funding