Grab'Em: A Novel Graph-Based Method for Combining Feature Subset Selectors

Aida de Haro-Garcia; Jose Perez-Parras Toledano; Gonzalo Cerruela-Garcia; Nicolas Garcia-Pedrajas

doi:10.1109/TCYB.2020.3018815

Grab'Em: A Novel Graph-Based Method for Combining Feature Subset Selectors

IEEE Trans Cybern. 2022 May;52(5):2942-2954. doi: 10.1109/TCYB.2020.3018815. Epub 2022 May 19.

Authors

Aida de Haro-Garcia, Jose Perez-Parras Toledano, Gonzalo Cerruela-Garcia, Nicolas Garcia-Pedrajas

PMID: 33027013
DOI: 10.1109/TCYB.2020.3018815

Abstract

Feature selection is one of the most frequent tasks in data mining applications. Its ability to remove useless and redundant features improves the classification performance and gains knowledge about a given problem makes feature selection a common first step in data mining. In many feature selection applications, we need to combine the results of different feature selection processes. The two most common scenarios are the ensembles of feature selectors and the scaling up of feature selection methods using a data division approach. The standard procedure is to store the number of times every feature has been selected as a vote for the feature and then evaluate different selection thresholds with a certain criterion to obtain the final subset of selected features. However, this method is suboptimal as the relationships of the features are not considered in the voting process. Two redundant features may be selected a similar number of times due to the different sets of instances used each time. Thus, a voting scheme would tend to select both of them. In this article, we present a new approach: instead of using only the number of times a feature has been selected, the approach considers how many times the features have been selected together by a feature selection algorithm. The proposal is based on constructing an undirected graph where the vertices are the features, and the edges count the number of times every pair of instances has been selected together. This graph is used to select the best subset of features, avoiding the redundancy introduced by the voting scheme. The proposal improves the results of the standard voting scheme in both ensembles of feature selectors and data division methods for scaling up feature selection.

MeSH terms

Algorithms*
Data Mining*
Research Design