Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data

BioTech (Basel). 2023 Jul 27;12(3):52. doi: 10.3390/biotech12030052.

Abstract

Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order selection for response categories, and variable selection. We perform our procedure on high-dimensional gene expression data with 801 patients, 2426 genes, and five types of cancerous tumors. As a result, we recommend three finalized models: one with 74 genes achieves extremely low cross-entropy loss and zero predictive error rate based on a five-fold cross-validation; and two other models with 31 and 4 genes, respectively, are recommended for prognostic multi-gene signatures.

Keywords: cross-validation; hurdle model; model selection; multinomial logistic model; order selection; variable selection; zero-inflated model.