Optimization of Imbalanced and Multidimensional Learning Under Bayes Minimum Risk and Savings Measure

Fatima El Barakaz; Omar Boutkhoum; Mohamed Hanine; Abdelmajid El Moutaouakkil; Furqan Rustam; Sadia Din; Imran Ashraf

doi:10.1089/big.2021.0225

Optimization of Imbalanced and Multidimensional Learning Under Bayes Minimum Risk and Savings Measure

Big Data. 2022 Oct;10(5):425-439. doi: 10.1089/big.2021.0225. Epub 2022 Jun 20.

Authors

Fatima El Barakaz¹, Omar Boutkhoum¹, Mohamed Hanine², Abdelmajid El Moutaouakkil¹, Furqan Rustam³, Sadia Din⁴, Imran Ashraf⁴

Affiliations

¹ Laroseri Laboratory, Faculty of Sciences, Chouaib Doukkali University, El Jadida, Morocco.
² Department of Telecommunications, Networks and Informatics, LTI Laboratory, ENSA, Chouaib Doukkali University, El Jadida, Morocco.
³ Department of Software Engineering, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.
⁴ Department of Information and Communication Engineering, Yeungnam University, Gyeongsan-si, Republic of Korea.

PMID: 35723636
DOI: 10.1089/big.2021.0225

Abstract

The full potential of data analysis is crippled by imbalanced and high-dimensional data, which makes these topics significantly important. Consequently, substantial research efforts have been directed to obtain dimension reduction and resolve data imbalance, especially in the context of fraud detection analysis. This work aims to investigate the effectiveness of hybrid learning methods for alleviating the class imbalance and integrating dimensionality reduction techniques. In this regard, the current study examines different classification combinations to achieve optimal savings and improve classification performance. Against this background, several well-known machine learning models are selected such as logistic regression, random forest, CatBoost (CB), and XGBoost. These models are constructed and optimized based on Bayes minimum risk (BMR) associated with the oversampling method synthetic minority oversampling technique (SMOTE) and different feature selection (FS) techniques, both univariate and multivariate. To investigate the performance of the proposed approach, different possible scenarios are analyzed both with and without balancing, with and without FS, and optimization using BMR. With a major insight about the best method to use, BMR shows a good optimization when used with SMOTE, symmetrical uncertainty for FS, and CB as a boosted classifier, principally in terms of F1 score and savings metrics.

Keywords: Bayes minimum risk; SMOTE; ensemble learning; feature selection; fraud detection; high-dimensional data analysis; multimodal data.

MeSH terms

Bayes Theorem
Data Analysis*
Income
Machine Learning*