Species translatable blood gene signature as a marker of exposure to smoking: computational approaches of the top ranked teams in the sbv IMPROVER Systems Toxicology challenge

Comput Toxicol. 2018 Feb:5:25-30. doi: 10.1016/j.comtox.2017.04.001. Epub 2017 Apr 28.

Abstract

Crowdsourcing has been used to address computational challenges in systems biology and assess translation of findings across species. Sub-challenge 2 of the sbv IMPROVER Systems Toxicology Challenge was designed to determine whether a common set of genes can be used to identify exposure to cigarette smoke in both human and mouse. Participating teams used a training set of human and mouse blood gene expression data to derive parsimonious models (up to 40 genes) that classify subjects into exposure groups: smokers, former smokers, and never-smokers. Teams were ranked based on two classification performance metrics evaluated on a blinded test dataset. Prediction of current exposure to cigarette smoke in human and mouse by a common prediction model was achieved by the top ranked team (Team 219) with 89% balanced accuracy (BAC), while past exposure was predicted with only 57% BAC. The prediction model of the top ranked team was a random forest classifier trained on sets of genes that appeared best for each species separately with no overlap between species. By contrast, Team 264, ranked second (tied with Team 250), selected genes that were simultaneously predictive in both species and achieved 80% and 59% BAC when predicting current and past exposure, respectively. These performance values were lower than the 96.5% and 61% BAC estimates for current and past exposure, respectively, obtained by Team 264 (top ranked in sub-challenge 1) when using only human data. Unlike past exposure, current exposure to cigarette smoke can be accurately assessed in both human and mouse with a common prediction model based on blood mRNAs. However, requiring a common gene signature to be predictive in both species resulted in a substantial decrease in balanced accuracy for prediction of current exposure to cigarette smoke (from 96.5% to 80%), suggesting species-specific responses exist.

Keywords: Systems toxicology; computational challenge; predictive modeling; smoking biomarker; species-translatable gene signature.