Biomarker signature identification in "omics" data with multi-class outcome

Vincenzo Lagani; George Kortas; Ioannis Tsamardinos

doi:10.5936/csbj.201303004

Biomarker signature identification in "omics" data with multi-class outcome

Comput Struct Biotechnol J. 2013 Jun 8:6:e201303004. doi: 10.5936/csbj.201303004. eCollection 2013.

Authors

Vincenzo Lagani¹, George Kortas², Ioannis Tsamardinos³

Affiliations

¹ Institute of Computer Science, Foundation for Research and Technology - Hellas (FORTH), N. Plastira 100, Vassilika Vouton, GR-700 13 Heraklion, Crete, Greece.
² Department of Computer Science, University of Crete, P.O.Box 2208, GR-710 03 Heraklion, Crete, Greece.
³ Institute of Computer Science, Foundation for Research and Technology - Hellas (FORTH), N. Plastira 100, Vassilika Vouton, GR-700 13 Heraklion, Crete, Greece ; Department of Computer Science, University of Crete, P.O.Box 2208, GR-710 03 Heraklion, Crete, Greece.

Abstract

Biomarker signature identification in "omics" data is a complex challenge that requires specialized feature selection algorithms. The objective of these algorithms is to select the smallest set(s) of molecular quantities that are able to predict a given outcome (target) with maximal predictive performance. This task is even more challenging when the outcome comprises of multiple classes; for example, one may be interested in identifying the genes whose expressions allow discrimination among different types of cancer (nominal outcome) or among different stages of the same cancer, e.g. Stage 1, 2, 3 and 4 of Lung Adenocarcinoma (ordinal outcome). In this work, we consider a particular type of successful feature selection methods, named constraint-based, local causal discovery algorithms. These algorithms depend on performing a series of conditional independence tests. We extend these algorithms for the analysis of problems with continuous predictors and multi-class outcomes, by developing and equipping them with an appropriate conditional independence test procedure for both nominal and ordinal multi-class targets. The test is based on multinomial logistic regression and employs the log-likelihood ratio test for model selection. We present a comparative, experimental evaluation on seven real-world, high-dimensional, gene-expression datasets. Within the scope of our analysis the results indicate that the new conditional independence test allows the identification of smaller and better performing signatures for multi-class outcome datasets, with respect to the current alternatives for performing the independence tests.

Keywords: Biomarker Signature Identification; Constraint-based Methods; Graphical Models; High Dimensional Data; Multiple Outcomes Studies; “Omics” Data.