EPX: An R package for the ensemble of subsets of variables for highly unbalanced binary classification

Comput Biol Med. 2021 Sep:136:104760. doi: 10.1016/j.compbiomed.2021.104760. Epub 2021 Aug 13.

Abstract

Background and objective: In binary classification problems with a rare class of interest, there is relatively little information available for the rare class to build a model. On the other hand, the number of useful variables to develop a model for classification can be high-dimensional. For example, in drug discovery, there are usually a very few bioactive compounds in a large chemical library, whereas thousands of potentially useful explanatory variables characterize a compound's chemical structure. The sparsity of information for the rare class of interest makes it difficult for the standard classification models to exploit the richness of the useful feature variables. Thus, the objective of this paper is to develop an R package which clusters the feature variables into diverse subsets to be aggregated into a powerful ensemble for the detection of a rare class object.

Methods: The ensemble of phalanxes (EPX) builds a classifier by exploiting the richness of feature variables using several diverse subsets of variables, called phalanxes, and outperforms many competitive state-of-the-art classification methods in terms of predictive ranking of the rare class of interest.

Results: We present an R package EPX which implements the algorithm to form the ensemble of phalanxes as well as its associated functions. We further show how the ensemble of phalanxes can be constructed using parallel computing to lower the computational burden given high-dimensional data.

Conclusions: The R package EPX shows a flexible way of clustering feature variable space into smaller and diverse subsets of variables to develop an ensemble of phalanxes which better ranks a rare class object in a highly unbalanced two class classification problem. The ensemble EPX will be useful to detect the rare drug-like active biomolecules for development in drug discovery (Tomal et al., Mar. 2016) [1] and homologous proteins using similarity scores of amino acid sequences in protein homology (Tomal et al., 2019) [2]. The package EPX is freely available to download from CRAN (https://CRAN.R-project.org/package=EPX).

Keywords: Drug discovery; EPX; Ensemble learning; Machine learning; Protein homology; R package.

MeSH terms

  • Algorithms*
  • Amino Acid Sequence
  • Cluster Analysis