A merged lung cancer transcriptome dataset for clinical predictive modeling

Sci Data. 2018 Jul 24:5:180136. doi: 10.1038/sdata.2018.136.

Abstract

The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; allowing others to perform large-scale meta-analysis, without having to conduct time-consuming data mining and statistical correction. Comprising 1,118 patient-derived samples, the normalized dataset includes primary non-small cell lung cancer (NSCLC) tumors and paired normal lung tissues from ten independent GEO datasets, facilitating differential expression analysis. The data has been merged, normalized, batch effect-corrected and filtered for genes with low variance via multiple open source R packages integrated into our workflow. Overall this dataset (with associated clinical metadata) better represents the diseased population and serves as a powerful tool for early predictive biomarker discovery.

Publication types

  • Dataset
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Carcinoma, Non-Small-Cell Lung / genetics*
  • Computational Biology / methods
  • Data Analysis
  • Databases, Factual
  • Gene Expression Profiling / methods
  • Humans
  • Lung Neoplasms / genetics*
  • Transcriptome*

Associated data

  • figshare/10.6084/m9.figshare.5350321