Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation

Meng-Fong Tsai; Shyr-Shen Yu

doi:10.1007/s10916-016-0516-3

Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation

J Med Syst. 2016 Jul;40(7):159. doi: 10.1007/s10916-016-0516-3. Epub 2016 May 16.

Authors

Meng-Fong Tsai¹, Shyr-Shen Yu²

Affiliations

¹ Department of Computer Science and Engineering, National Chung Hsing University, Taichung, 402, Taiwan.
² Department of Computer Science and Engineering, National Chung Hsing University, Taichung, 402, Taiwan. pyu@nchu.edu.tw.

PMID: 27185255
DOI: 10.1007/s10916-016-0516-3

Abstract

An imbalanced classification means that a dataset has an unequal class distribution among its population. For any given dataset, regardless of any balancing issue, the predictions made by most classification methods are highly accurate for the majority class but significantly less accurate for the minority class. To overcome this problem, this study took several imbalanced datasets from the famed UCI datasets and designed and implemented an efficient algorithm which couples Top-N Reverse k-Nearest Neighbor (TRkNN) with the Synthetic Minority Oversampling TEchnique (SMOTE). The proposed algorithm was investigated by applying it to classification methods such as logistic regression (LR), C4.5, Support Vector Machine (SVM), and Back Propagation Neural Network (BPNN). This research also adopted different distance metrics to classify the same UCI datasets. The empirical results illustrate that the Euclidean and Manhattan distances are not only more accurate, but also show greater computational efficiency when compared to the Chebyshev and Cosine distances. Therefore, the proposed algorithm based on TRkNN and SMOTE can be widely used to handle imbalanced datasets. Our recommendations on choosing suitable distance metrics can also serve as a reference for future studies.

Keywords: Distance Metric; Imbalanced classification; Synthetic minority oversampling technique; UCI Dataset.

MeSH terms

Algorithms*
Cluster Analysis
Computational Biology / methods*
Data Accuracy*
Humans
Logistic Models
Neural Networks, Computer