Learning biologically-interpretable latent representations for gene expression data: Pathway Activity Score Learning Algorithm

Ioulia Karagiannaki; Krystallia Gourlia; Vincenzo Lagani; Yannis Pantazis; Ioannis Tsamardinos

doi:10.1007/s10994-022-06158-z

Learning biologically-interpretable latent representations for gene expression data: Pathway Activity Score Learning Algorithm

Mach Learn. 2023;112(11):4257-4287. doi: 10.1007/s10994-022-06158-z. Epub 2022 Apr 29.

Authors

Ioulia Karagiannaki¹, Krystallia Gourlia², Vincenzo Lagani^{3

4}, Yannis Pantazis⁵, Ioannis Tsamardinos^{2

4

5}

Affiliations

¹ Institute of Electronic Structure and Laser, Foundation for Research and Technology-Hellas (IESL-FORTH), Heraklion, Greece.
² Department of Computer Science, University of Crete, Heraklion, Greece.
³ Institute of Chemical Biology, Ilia State University, Tbilisi, 0162 Georgia.
⁴ JADBio, Gnosis Data Analysis PC, Heraklion, Crete Greece.
⁵ Institute of Applied and Computational Mathematics, Foundation for Research and Technology - Hellas, Heraklion, Greece.

Abstract

Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (i.e., high dimensional data). However, lower-dimensional representations that retain the useful biological information do exist. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways (genesets in general) and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL's latent space has a fairly straightforward biological interpretation. PASL is shown to outperform in predictive performance the state-of-the-art method (PLIER) on two collections of breast cancer and leukemia gene expression datasets. PASL is also trained on a large corpus of 50000 gene expression samples to construct a universal dictionary of features across different tissues and pathologies. The dictionary validated on 35643 held-out samples for reconstruction error. It is then applied on 165 held-out datasets spanning a diverse range of diseases. The AutoML tool JADBio is employed to show that the predictive information in the PASL-created feature space is retained after the transformation. The code is available at https://github.com/mensxmachina/PASL.

Keywords: Differential activation analysis; Dimensionality reduction; Disease classification; Pathway activity.