Optimization of Imbalanced and Multidimensional Learning Under Bayes Minimum Risk and Savings Measure

Big Data. 2022 Oct;10(5):425-439. doi: 10.1089/big.2021.0225. Epub 2022 Jun 20.

Abstract

The full potential of data analysis is crippled by imbalanced and high-dimensional data, which makes these topics significantly important. Consequently, substantial research efforts have been directed to obtain dimension reduction and resolve data imbalance, especially in the context of fraud detection analysis. This work aims to investigate the effectiveness of hybrid learning methods for alleviating the class imbalance and integrating dimensionality reduction techniques. In this regard, the current study examines different classification combinations to achieve optimal savings and improve classification performance. Against this background, several well-known machine learning models are selected such as logistic regression, random forest, CatBoost (CB), and XGBoost. These models are constructed and optimized based on Bayes minimum risk (BMR) associated with the oversampling method synthetic minority oversampling technique (SMOTE) and different feature selection (FS) techniques, both univariate and multivariate. To investigate the performance of the proposed approach, different possible scenarios are analyzed both with and without balancing, with and without FS, and optimization using BMR. With a major insight about the best method to use, BMR shows a good optimization when used with SMOTE, symmetrical uncertainty for FS, and CB as a boosted classifier, principally in terms of F1 score and savings metrics.

Keywords: Bayes minimum risk; SMOTE; ensemble learning; feature selection; fraud detection; high-dimensional data analysis; multimodal data.

MeSH terms

  • Bayes Theorem
  • Data Analysis*
  • Income
  • Machine Learning*