XA4C: eXplainable representation learning via Autoencoders revealing Critical genes

Qing Li; Yang Yu; Pathum Kossinna; Theodore Lun; Wenyuan Liao; Qingrun Zhang

doi:10.1371/journal.pcbi.1011476

XA4C: eXplainable representation learning via Autoencoders revealing Critical genes

PLoS Comput Biol. 2023 Oct 2;19(10):e1011476. doi: 10.1371/journal.pcbi.1011476. eCollection 2023 Oct.

Authors

Qing Li¹, Yang Yu², Pathum Kossinna¹, Theodore Lun¹, Wenyuan Liao², Qingrun Zhang^{1

2

3

4}

Affiliations

¹ Department of Biochemistry & Molecular Biology, University of Calgary, Calgary, Canada.
² Department of Mathematics and Statistics, University of Calgary, Calgary, Canada.
³ Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Canada.
⁴ Arnie Charbonneau Cancer Institute, University of Calgary, Calgary, Canada.

Abstract

Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the "latent variables" in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL's broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose "Critical genes", defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene's contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET) and a cancer-specific database (COSMIC), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions.

Copyright: © 2023 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Artificial Intelligence*
Databases, Factual
Gene Expression Profiling
Genes, Essential
Humans
Neoplasms*

Grants and funding

Q.Z. is supported by an NSERC Discovery Grant (RGPIN-2018-05147), a University of Calgary VPR Catalyst grant, and a New Frontiers in Research Fund (NFRFE-2018-00748). W.L. is partly supported by an NSERC CRD Grant (CRDPJ532227-18). Q.L. is partly supported by an Alberta Innovates LevMax-Health Program Bridge Funds (222300769). The computational infrastructure is funded by a Canada Foundation for Innovation JELF grant (36605) and an NSERC RTI grant (RTI-2021-00675). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.