Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data

Robert J O'Shea; Sophia Tsoka; Gary Jr Cook; Vicky Goh

doi:10.1177/11769351211056298

Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data

Cancer Inform. 2021 Nov 27:20:11769351211056298. doi: 10.1177/11769351211056298. eCollection 2021.

Authors

Robert J O'Shea¹, Sophia Tsoka², Gary Jr Cook^{1

3}, Vicky Goh^{1

4}

Affiliations

¹ Department of Cancer Imaging, School of Biomedical Engineering and Imaging Sciences, King's College London, London, UK.
² Department of Informatics, School of Natural and Mathematical Sciences, King's College London, London, UK.
³ King's College London & Guy's and St Thomas' PET Centre, St Thomas' Hospital, London, UK.
⁴ Department of Radiology, Guy's and St Thomas' NHS Foundation Trust, London, UK.

Abstract

Background: Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions - approaches which are susceptible to misrepresentation and incompleteness, respectively. The objectives of this analysis are to (1) provide a real-world data-driven approach for comparing performance of genomic model inference algorithms, (2) compare the performance of LASSO, elastic net, best-subset selection, $L_{0} L_{1}$ penalisation and $L_{0} L_{2}$ penalisation in real genomic data and (3) compare algorithmic preselection according to performance in our benchmark datasets to algorithmic selection by internal cross-validation.

Methods: Five large $(n 4000)$ genomic datasets were extracted from Gene Expression Omnibus. 'Gold-standard' regression models were trained on subspaces of these datasets ( $n 4000$ , $p = 500$ ). Penalised regression models were trained on small samples from these subspaces ( $n \in {25, 75, 150}, p = 500$ ) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed. Penalty 'preselection' according to test performance in the other 4 datasets was compared to selection internal cross-validation error minimisation.

Results: $L_{1} L_{2}$ -penalisation achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. $L_{0} L_{2}$ -penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. $L_{0} L_{2}$ also attained the highest overall median variable selection F1 score. Penalty preselection significantly outperformed selection by internal cross-validation in each of 3 examined metrics.

Conclusions: This analysis explores a novel approach for comparisons of model selection approaches in real genomic data from 5 cancers. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of $L_{0} L_{2}$ penalisation for structural selection and $L_{1} L_{2}$ penalisation for coefficient recovery in genomic data. Evaluation of learning algorithms according to observed test performance in external genomic datasets yields valuable insights into actual test performance, providing a data-driven complement to internal cross-validation in genomic regression tasks.

Keywords: Artificial intelligence; computational biology; gene regulatory networks; genomics; models; statistical.

Abstract

Grants and funding