Model performance and interpretability of semi-supervised generative adversarial networks to predict oncogenic variants with unlabeled data

BMC Bioinformatics. 2023 Feb 9;24(1):43. doi: 10.1186/s12859-023-05141-2.

Abstract

Background: It remains an important challenge to predict the functional consequences or clinical impacts of genetic variants in human diseases, such as cancer. An increasing number of genetic variants in cancer have been discovered and documented in public databases such as COSMIC, but the vast majority of them have no functional or clinical annotations. Some databases, such as CiVIC are available with manual annotation of functional mutations, but the size of the database is small due to the use of human annotation. Since the unlabeled data (millions of variants) typically outnumber labeled data (thousands of variants), computational tools that take advantage of unlabeled data may improve prediction accuracy.

Result: To leverage unlabeled data to predict functional importance of genetic variants, we introduced a method using semi-supervised generative adversarial networks (SGAN), incorporating features from both labeled and unlabeled data. Our SGAN model incorporated features from clinical guidelines and predictive scores from other computational tools. We also performed comparative analysis to study factors that influence prediction accuracy, such as using different algorithms, types of features, and training sample size, to provide more insights into variant prioritization. We found that SGAN can achieve competitive performances with small labeled training samples by incorporating unlabeled samples, which is a unique advantage compared to traditional machine learning methods. We also found that manually curated samples can achieve a more stable predictive performance than publicly available datasets.

Conclusions: By incorporating much larger samples of unlabeled data, the SGAN method can improve the ability to detect novel oncogenic variants, compared to other machine-learning algorithms that use only labeled datasets. SGAN can be potentially used to predict the pathogenicity of more complex variants such as structural variants or non-coding variants, with the availability of more training samples and informative features.

Keywords: Deep learning; Generative adversarial networks; Machine learning; Somatic variants; Variants annotation; Variants interpretation.

Publication types

  • Review

MeSH terms

  • Algorithms*
  • Databases, Factual
  • Humans
  • Machine Learning
  • Neoplasms* / genetics
  • Supervised Machine Learning