TSPLASSO: A Two-stage Prior LASSO Algorithm for Gene Selection using Omics Data

IEEE J Biomed Health Inform. 2023 Oct 23:PP. doi: 10.1109/JBHI.2023.3326485. Online ahead of print.

Abstract

Feature selection has been extensively applied to identify cancer genes using omics data. Although substantial studies have been conducted to search for cancer genes, the available rich knowledge on various cancers is seldom used as prior information in feature selection. This paper proposes a two-stage prior LASSO (TSPLASSO) method, which represents an early attempt in designing feature selection algorithms using prior information. The first stage performs gene selection via linear regression with LASSO. Candidate genes that are correlated with known cancer genes are retained for subsequent analysis. The second stage establishes a logistic regression model with LASSO to realize final cancer gene selection and sample classification. The key advantages of TSPLASSO include the successive consideration of prior cancer genes and binary sample types as response variables in stages one and two, respectively. In addition, the TSPLASSO performs sample classification and variable selection simultaneously. Compared with six state-of-the-art algorithms, numerical simulations in six real-world datasets show that TSPLASSO can improve the accuracy of variable selection by 5%-400% in the three bulk sequencing datasets and the scRNA-seq dataset; and the performance is robust against data noise and variations of prior cancer genes. The TSPLASSO provides an efficient, stable and practical algorithm for exploring biomedcial and health informatics from omics data.