SEQENS: An ensemble method for relevant gene identification in microarray data

Comput Biol Med. 2023 Jan:152:106413. doi: 10.1016/j.compbiomed.2022.106413. Epub 2022 Dec 6.

Abstract

This paper describes an ensemble feature identification algorithm called SEQENS, and measures its capability to identify the relevant variables in a case-control study using a genetic expression microarray dataset. SEQENS uses Sequential Feature Search on multiple sample splitting to select variables showing stronger relation with the target, and a variable relevance ranking is finally produced. Although designed for feature identification, SEQENS could also serve as a basis for feature selection (classifier optimisation). Cliff, a ranking evaluation metric is also presented and used to assess the feature identification algorithms when a groundtruth of relevant variables is available. To test performance, three types of synthetic groundtruths emulating fictitious diseases are generated from ten randomly chosen variables following different target pattern distributions using the E-MTAB-3732 dataset. Several sample-to-dimensionality ratios ranging from 300 to 3,000 observations and 854 to 54,675 variables are explored. SEQENS is compared with other feature selection or identification state-of-the-art methods. On average, the proposed algorithm identifies better the relevant genes and exhibits a stronger stability. The algorithm is available to the community.

Keywords: Ensemble method; Feature selection; Gene identification; High dimensionality spaces; Microarray data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Case-Control Studies
  • Oligonucleotide Array Sequence Analysis / methods