Distributed Selection of Continuous Features in Multilabel Classification Using Mutual Information

IEEE Trans Neural Netw Learn Syst. 2020 Jul;31(7):2280-2293. doi: 10.1109/TNNLS.2019.2944298. Epub 2019 Oct 21.

Abstract

Multilabel learning is a challenging task demanding scalable methods for large-scale data. Feature selection has shown to improve multilabel accuracy while defying the curse of dimensionality of high-dimensional scattered data. However, the increasing complexity of multilabel feature selection, especially on continuous features, requires new approaches to manage data effectively and efficiently in distributed computing environments. This article proposes a distributed model for mutual information (MI) adaptation on continuous features and multiple labels on Apache Spark. Two approaches are presented based on MI maximization, and minimum redundancy and maximum relevance. The former selects the subset of features that maximize the MI between the features and the labels, whereas the latter additionally minimizes the redundancy between the features. Experiments compare the distributed multilabel feature selection methods on 10 data sets and 12 metrics. Results validated through statistical analysis indicate that our methods outperform reference methods for distributed feature selection for multilabel data, while MIM also reduces the runtime in orders of magnitude.