Explainable artificial intelligence to predict and identify prostate cancer tissue by gene expression

Comput Methods Programs Biomed. 2023 Oct:240:107719. doi: 10.1016/j.cmpb.2023.107719. Epub 2023 Jul 10.

Abstract

Background and objective: Prostate cancer is one of the most prevalent forms of cancer in men worldwide. Traditional screening strategies such as serum PSA levels, which are not necessarily cancer-specific, or digital rectal exams, which are often inconclusive, are still the screening methods used for the disease. Some studies have focused on identifying biomarkers of the disease but none have been reported for diagnosis in routine clinical practice and few studies have provided tools to assist the pathologist in the decision-making process when analyzing prostate tissue. Therefore, a classifier is proposed to predict the occurrence of PCa that provides physicians with accurate predictions and understandable explanations.

Methods: A selection of 47 genes was made based on differential expression between PCa and normal tissue, GO gene ontology as well as the literature to be used as input predictors for different machine learning methods based on eXplainable Artificial Intelligence. These methods were trained using different class-balancing strategies to build accurate classifiers using gene expression data from 550 samples from 'The Cancer Genome Atlas'. Our model was validated in four external cohorts with different ancestries, totaling 463 samples. In addition, a set of SHapley Additive exPlanations was provided to help clinicians understand the underlying reasons for each decision.

Results: An in-depth analysis showed that the Random Forest algorithm combined with majority class downsampling was the best performing approach with robust statistical significance. Our method achieved an average sensitivity and specificity of 0.90 and 0.8 with an AUC of 0.84 across all databases. The relevance of DLX1, MYL9 and FGFR genes for PCa screening was demonstrated in addition to the important role of novel genes such as CAV2 and MYLK.

Conclusions: This model has shown good performance in 4 independent external cohorts of different ancestries and the explanations provided are consistent with each other and with the literature, opening a horizon for its application in clinical practice. In the near future, these genes, in combination with our model, could be applied to liquid biopsy to improve PCa screening.

Keywords: Biomedical informatics; Clinical decision support; Explainable artificial intelligence; Machine learning; Molecular biology; Prostate cancer.

MeSH terms

  • Artificial Intelligence*
  • Gene Expression
  • Humans
  • Male
  • Prostatic Neoplasms* / genetics
  • Sensitivity and Specificity