Human blood gene signature as a marker for smoking exposure: computational approaches of the top ranked teams in the sbv IMPROVER Systems Toxicology challenge

Comput Toxicol. 2018 Feb:5:31-37. doi: 10.1016/j.comtox.2017.07.003. Epub 2017 Jul 18.

Abstract

Crowdsourcing has emerged as a framework to address methodological challenges in omics data analysis and assess the extent to which omics data are predictive of phenotypes of interest. The sbv IMPROVER Systems Toxicology Challenge was designed to leverage crowdsourcing to determine whether human blood gene expression levels are informative of current and past smoking. Participating teams were invited to use a training gene expression dataset to derive parsimonious models (up to 40 genes) that can accurately classify subjects into exposure groups: smokers, former smokers that quit for at least one year, and never-smokers. Teams were ranked based on two classification performance metrics evaluated on a blinded test dataset. The analytical approaches of the first- and third-ranked teams, that are presented in detail in this article, involved feature selection by moderated t-test or LASSO regression and linear discriminant analysis (LDA) and logistic regression classifiers, respectively. While the 12-gene signature of the top team allowed the classification of current smokers with 100% sensitivity at 93% specificity, discriminating former smokers from never-smokers was much more challenging (65% sensitivity at 57% specificity). Gene ontology molecular functions and KEGG pathways associated with current smoking included G protein-coupled receptor activity, signaling receptor activity, calcium ion binding, and the Neuroactive ligand-receptor interaction pathway. Selection of marker genes by either moderated t-test or multivariate LASSO regression followed by LDA or logistic regression, are robust approaches to classification with omics data, confirming in part findings of previous sbv IMPROVER challenges. While current smoking is accurately identified based on blood mRNA levels, smoking cessation for more than one year is accompanied by a "normalization" of the expression of certain mRNAs, making it difficult to distinguish former smokers from never-smokers.

Keywords: Systems toxicology; computational challenge; gene signature; predictive modeling; smoking biomarker.