Dissection of gene expression datasets into clinically relevant interaction signatures via high-dimensional correlation maximization

Nat Commun. 2019 Nov 28;10(1):5417. doi: 10.1038/s41467-019-12713-5.

Abstract

Gene expression is controlled by many simultaneous interactions, frequently measured collectively in biology and medicine by high-throughput technologies. It is a highly challenging task to infer from these data the generating effects and cooperating genes. Here, we present an unsupervised hypothesis-generating learning concept termed signal dissection by correlation maximization (SDCM) that dissects large high-dimensional datasets into signatures. Each signature captures a particular signal pattern that was consistently observed for multiple genes and samples, likely caused by the same underlying interaction. A key difference to other methods is our flexible nonlinear signal superposition model, combined with a precise regression technique. Analyzing gene expression of diffuse large B-cell lymphoma, our method discovers previously unidentified signatures that reveal significant differences in patient survival. These signatures are more predictive than those from various methods used for comparison and robustly validate across technological platforms. This implies highly specific extraction of clinically relevant gene interactions.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology
  • Correlation of Data
  • Data Visualization*
  • Datasets as Topic
  • Gene Expression Profiling
  • Gene Expression*
  • Humans
  • Lymphoma, Large B-Cell, Diffuse / genetics*
  • Unsupervised Machine Learning*