A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma

PLoS One. 2022 Sep 6;17(9):e0269126. doi: 10.1371/journal.pone.0269126. eCollection 2022.

Abstract

Lung cancer (LC) represents most of the cancer incidences in the world. There are many types of LC, but Lung Adenocarcinoma (LUAD) is the most common type. Although RNA-seq and microarray data provide a vast amount of gene expression data, most of the genes are insignificant to clinical diagnosis. Feature selection (FS) techniques overcome the high dimensionality and sparsity issues of the large-scale data. We propose a framework that applies an ensemble of feature selection techniques to identify genes highly correlated to LUAD. Utilizing LUAD RNA-seq data from the Cancer Genome Atlas (TCGA), we employed mutual information (MI) and recursive feature elimination (RFE) feature selection techniques along with support vector machine (SVM) classification model. We have also utilized Random Forest (RF) as an embedded FS technique. The results were integrated and candidate biomarker genes across all techniques were identified. The proposed framework has identified 12 potential biomarkers that are highly correlated with different LC types, especially LUAD. A predictive model has been trained utilizing the identified biomarker expression profiling and performance of 97.99% was achieved. In addition, upon performing differential gene expression analysis, we could find that all 12 genes were significantly differentially expressed between normal and LUAD tissues, and strongly correlated with LUAD according to previous reports. We here propose that using multiple feature selection methods effectively reduces the number of identified biomarkers and directly affects their biological relevance.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adenocarcinoma of Lung* / diagnosis
  • Adenocarcinoma of Lung* / genetics
  • Adenocarcinoma* / diagnosis
  • Adenocarcinoma* / genetics
  • Biomarkers
  • Biomarkers, Tumor / genetics
  • Gene Expression Regulation, Neoplastic
  • Humans
  • Lung Neoplasms* / diagnosis
  • Lung Neoplasms* / genetics
  • Lung Neoplasms* / pathology
  • Support Vector Machine

Substances

  • Biomarkers
  • Biomarkers, Tumor

Grants and funding

The study is funded by International Centre for Genetic Engineering and Biotechnology (ICGEB) 'CRP/EGY18-05_EC’. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.