Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Vitor Werner de Vargas; Jorge Arthur Schneider Aranda; Ricardo Dos Santos Costa; Paulo Ricardo da Silva Pereira; Jorge Luis Victória Barbosa

doi:10.1007/s10115-022-01772-8

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Knowl Inf Syst. 2023;65(1):31-57. doi: 10.1007/s10115-022-01772-8. Epub 2022 Nov 9.

Authors

Vitor Werner de Vargas¹, Jorge Arthur Schneider Aranda¹, Ricardo Dos Santos Costa², Paulo Ricardo da Silva Pereira², Jorge Luis Victória Barbosa^{1

2}

Affiliations

¹ Applied Computing Graduate Program, University of Vale do Rio dos Sinos, São Leopoldo, Rio Grande do Sul 93022-750 Brazil.
² Electrical Engineering Graduate Program, University of Vale do Rio dos Sinos, São Leopoldo, Rio Grande do Sul 93022-750 Brazil.

Abstract

Machine Learning (ML) algorithms have been increasingly replacing people in several application domains-in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, cost-sensitive and ensemble learning. These solutions reduce the naturally occurring bias towards the majority sample through ML. This study uses a systematic mapping methodology to assess 9927 papers related to sampling techniques for ML in imbalanced data applications from 7 digital libraries. A filtering process selected 35 representative papers from various domains, such as health, finance, and engineering. As a result of a thorough quantitative analysis of these papers, this study proposes two taxonomies-illustrating sampling techniques and ML models. The results indicate that oversampling and classical ML are the most common preprocessing techniques and models, respectively. However, solutions with neural networks and ensemble ML models have the best performance-with potentially better results through hybrid sampling techniques. Finally, none of the 35 works apply simulation-based synthetic oversampling, indicating a path for future preprocessing solutions.

Keywords: Imbalanced data; Machine learning; Preprocessing techniques; Sampling; Systematic mapping study.

© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022, Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.