Optimization of Skewed Data Using Sampling-Based Preprocessing Approach

Sushruta Mishra; Pradeep Kumar Mallick; Lambodar Jena; Gyoo-Soo Chae

doi:10.3389/fpubh.2020.00274

Optimization of Skewed Data Using Sampling-Based Preprocessing Approach

Front Public Health. 2020 Jul 16:8:274. doi: 10.3389/fpubh.2020.00274. eCollection 2020.

Authors

Sushruta Mishra¹, Pradeep Kumar Mallick¹, Lambodar Jena², Gyoo-Soo Chae³

Affiliations

¹ School of Computer Engineering, Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India.
² Department of Computer Science and Engineering, Siksha 'O' Anusandhan Deemed to be University, Bhubaneswar, India.
³ Division of Information & Communication, Baekseok University, ChePonan-si, South Korea.

Abstract

In the past few years, classification has undergone some major evolution. With a constant surge of the amount of data gathered from different sources, efficient processing and analysis of data is becoming difficult. Due to the uneven distribution of data among classes, data classification with machine-learning techniques has become more tedious. While most algorithms focus on major data samples, they ignore the minor class data. Thus, the data-skewing issue is one of the critical problems that need attention of researchers. The paper stresses upon data preprocessing using sampling techniques to overcome the data-skewing problem. Here, three different sampling techniques such as Resampling, SpreadSubSampling, and SMOTE are implemented to reduce this uneven data distribution issue and classified with the K-nearest neighbor algorithm. The performance of classification is evaluated with various performance metrics to determine the efficiency of classification.

Keywords: F-score; KNN algorithm; SMOTE; SpreadSubSampling; best first search; data skewing problem; machine learning.

MeSH terms

Algorithms*
Cluster Analysis
Machine Learning*