Gene expression microarray public dataset reanalysis in chronic obstructive pulmonary disease

PLoS One. 2019 Nov 15;14(11):e0224750. doi: 10.1371/journal.pone.0224750. eCollection 2019.

Abstract

Chronic obstructive pulmonary disease (COPD) was classified by the Centers for Disease Control and Prevention in 2014 as the 3rd leading cause of death in the United States (US). The main cause of COPD is exposure to tobacco smoke and air pollutants. Problems associated with COPD include under-diagnosis of the disease and an increase in the number of smokers worldwide. The goal of our study is to identify disease variability in the gene expression profiles of COPD subjects compared to controls, by reanalyzing pre-existing, publicly available microarray expression datasets. Our inclusion criteria for microarray datasets selected for smoking status, age and sex of blood donors reported. Our datasets used Affymetrix, Agilent microarray platforms (7 datasets, 1,262 samples). We re-analyzed the curated raw microarray expression data using R packages, and used Box-Cox power transformations to normalize datasets. To identify significant differentially expressed genes we used generalized least squares models with disease state, age, sex, smoking status and study as effects that also included binary interactions, followed by likelihood ratio tests (LRT). We found 3,315 statistically significant (Storey-adjusted q-value <0.05) differentially expressed genes with respect to disease state (COPD or control). We further filtered these genes for biological effect using results from LRT q-value <0.05 and model estimates' 10% two-tailed quantiles of mean differences between COPD and control), to identify 679 genes. Through analysis of disease, sex, age, and also smoking status and disease interactions we identified differentially expressed genes involved in a variety of immune responses and cell processes in COPD. We also trained a logistic regression model using the common array genes as features, which enabled prediction of disease status with 81.7% accuracy. Our results give potential for improving the diagnosis of COPD through blood and highlight novel gene expression disease signatures.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Age Factors
  • Air Pollutants / adverse effects
  • Biomarkers / metabolism
  • Data Mining*
  • Datasets as Topic
  • Down-Regulation
  • Female
  • Gene Expression Profiling / statistics & numerical data
  • Humans
  • Logistic Models
  • Machine Learning
  • Male
  • Models, Genetic
  • Oligonucleotide Array Sequence Analysis / statistics & numerical data
  • Pulmonary Disease, Chronic Obstructive / diagnosis
  • Pulmonary Disease, Chronic Obstructive / epidemiology*
  • Pulmonary Disease, Chronic Obstructive / etiology
  • Pulmonary Disease, Chronic Obstructive / genetics
  • Risk Assessment / methods
  • Risk Factors
  • Sex Factors
  • Smoking / adverse effects
  • Smoking / epidemiology
  • Transcriptome / genetics*
  • United States / epidemiology
  • Up-Regulation

Substances

  • Air Pollutants
  • Biomarkers

Associated data

  • figshare/10.6084/m9.figshare.8233175

Grants and funding

LRKR is funded through the University Enrichment Fellowship at Michigan State University. GIM is funded by Jean P. Schultz Endowed Biomedical Research Fund.