Learning vector quantized representation for cancer subtypes identification

Comput Methods Programs Biomed. 2023 Jun:236:107543. doi: 10.1016/j.cmpb.2023.107543. Epub 2023 Apr 11.

Abstract

Background and objective: Defining and separating cancer subtypes is essential for facilitating personalized therapy modality and prognosis of patients. The definition of subtypes has been constantly recalibrated as a result of our deepened understanding. During this recalibration, researchers often rely on clustering of cancer data to provide an intuitive visual reference that could reveal the intrinsic characteristics of subtypes. The data being clustered are often omics data such as transcriptomics that have strong correlations to the underlying biological mechanism. However, while existing studies have shown promising results, they suffer from issues associated with omics data: sample scarcity and high dimensionality while they impose unrealistic assumptions to extract useful features from the data while avoiding overfitting to spurious correlations.

Methods: This paper proposes to leverage a recent strong generative model, Vector-Quantized Variational AutoEncoder, to tackle the data issues and extract discrete representations that are crucial to the quality of subsequent clustering by retaining only information relevant to reconstructing the input.

Results: Extensive experiments and medical analysis on multiple datasets comprising 10 distinct cancers demonstrate the proposed clustering results can significantly and robustly improve prognosis over prevalent subtyping systems.

Conclusion: Our proposal does not impose strict assumptions on data distribution; while, its latent features are better representations of the transcriptomic data in different cancer subtypes, capable of yielding superior clustering performance with any mainstream clustering method.

Keywords: Cancer subtyping; Clustering; Deep generative models; Vector quantization.

MeSH terms

  • Cluster Analysis
  • Gene Expression Profiling
  • Humans
  • Neoplasms*
  • Transcriptome