A self-training algorithm based on the two-stage data editing method with mass-based dissimilarity

Jikui Wang; Yiwen Wu; Shaobo Li; Feiping Nie

doi:10.1016/j.neunet.2023.09.046

A self-training algorithm based on the two-stage data editing method with mass-based dissimilarity

Neural Netw. 2023 Nov:168:431-449. doi: 10.1016/j.neunet.2023.09.046. Epub 2023 Sep 29.

Authors

Jikui Wang¹, Yiwen Wu², Shaobo Li³, Feiping Nie⁴

Affiliations

¹ School of Information Engineering and Artifical Intelligence, Lanzhou University of Finance and Economics, Lanzhou 730020, Gansu, China. Electronic address: wjkweb@163.com.
² School of Information Engineering and Artifical Intelligence, Lanzhou University of Finance and Economics, Lanzhou 730020, Gansu, China. Electronic address: 2516482760@qq.com.
³ State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, Guizhou, China. Electronic address: lishaobo@gzu.edu.cn.
⁴ School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi'an 710072, Shanxi, China. Electronic address: feipingnie@gmail.com.

PMID: 37804746
DOI: 10.1016/j.neunet.2023.09.046

Abstract

A self-training algorithm is a classical semi-supervised learning algorithm that uses a small number of labeled samples and a large number of unlabeled samples to train a classifier. However, the existing self-training algorithms consider only the geometric distance between data while ignoring the data distribution when calculating the similarity between samples. In addition, misclassified samples can severely affect the performance of a self-training algorithm. To address the above two problems, this paper proposes a self-training algorithm based on data editing with mass-based dissimilarity (STDEMB). First, the mass matrix with the mass-based dissimilarity is obtained, and then the mass-based local density of each sample is determined based on its k nearest neighbors. Inspired by density peak clustering (DPC), this study designs a prototype tree based on the prototype concept. In addition, an efficient two-stage data editing algorithm is developed to edit misclassified samples and efficiently select high-confidence samples during the self-training process. The proposed STDEMB algorithm is verified by experiments using accuracy and F-score as evaluation metrics. The experimental results on 18 benchmark datasets demonstrate the effectiveness of the proposed STDEMB algorithm.

Keywords: Data editing; Mass-based dissimilarity; Relative node set; Self-training algorithm.

MeSH terms

Algorithms*
Benchmarking
Cluster Analysis
Supervised Machine Learning*