Random forests with parametric entropy-based information gains for classification and regression problems

PeerJ Comput Sci. 2024 Jan 3:10:e1775. doi: 10.7717/peerj-cs.1775. eCollection 2024.

Abstract

The random forest algorithm is one of the most popular and commonly used algorithms for classification and regression tasks. It combines the output of multiple decision trees to form a single result. Random forest algorithms demonstrate the highest accuracy on tabular data compared to other algorithms in various applications. However, random forests and, more precisely, decision trees, are usually built with the application of classic Shannon entropy. In this article, we consider the potential of deformed entropies, which are successfully used in the field of complex systems, to increase the prediction accuracy of random forest algorithms. We develop and introduce the information gains based on Renyi, Tsallis, and Sharma-Mittal entropies for classification and regression random forests. We test the proposed algorithm modifications on six benchmark datasets: three for classification and three for regression problems. For classification problems, the application of Renyi entropy allows us to improve the random forest prediction accuracy by 19-96% in dependence on the dataset, Tsallis entropy improves the accuracy by 20-98%, and Sharma-Mittal entropy improves accuracy by 22-111% compared to the classical algorithm. For regression problems, the application of deformed entropies improves the prediction by 2-23% in terms of R2 in dependence on the dataset.

Keywords: Classification; Random forest; Regression; Renyi entropy; Sharma-Mittal entropy; Tsallis entropy.

Grants and funding

This work was supported by the Basic Research Program at the National Research University Higher School of Economics in 2023 (project “Innovative methods of data collection and analysis in the modeling of communicative behavior of Internet users and the development of respective technological solutions”). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.