PLS-based gene subset augmentation and tumor-specific gene identification

Comput Biol Med. 2024 May:174:108434. doi: 10.1016/j.compbiomed.2024.108434. Epub 2024 Apr 16.

Abstract

In the study of tumor disease pathogenesis, the identification of genes specifically expressed in disease states is pivotal, yet challenges arise from high-dimensional datasets with limited samples. Conventional gene (feature) selection methods often fall short of capturing the complexity of gene-phenotype and gene-gene interactions, necessitating a more robust analysis method. To address these challenges, a gene subset augmentation strategy is proposed in this paper. Our approach introduces diverse perturbation mechanisms to generate distinct gene subsets. The partial least squares-based multiple gene measurement algorithm considers gene-phenotype and gene-gene correlations, identifying differentially expressed genes, including those with weak signals. The constructed gene networks derived from the augmented subsets unveil regulatory patterns, enabling association analysis to explore gene associations comprehensively. Our algorithm excels in identifying small-sized gene subsets with strong discriminative power, surpassing traditional methods that yield a single gene subset. Unlike conventional approaches, our algorithm reveals a spectrum of different gene subsets and their weakly differentially expressed genes. This nuanced perspective aids in unraveling the molecular characteristics and specific expression patterns of tumor genes. The versatility of our approach not only contributes to the advancement of tumor-specific gene identification but also holds promise for addressing challenges in various fields characterized by high-dimensional datasets and limited samples. The Python implementation is available at http://github.com/wenjieyou/PLSGSA.

Keywords: Gene (feature) subset augmentation; Gene association analysis; High-dimensional small sample; Tumor-specific genes; Weakly differentially expressed genes.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Databases, Genetic
  • Gene Expression Profiling
  • Gene Expression Regulation, Neoplastic
  • Gene Regulatory Networks
  • Humans
  • Least-Squares Analysis
  • Neoplasms* / genetics