Clustering distributions with the marginalized nested Dirichlet process

Biometrics. 2018 Jun;74(2):584-594. doi: 10.1111/biom.12778. Epub 2017 Sep 28.

Abstract

We introduce a marginal version of the nested Dirichlet process to cluster distributions or histograms. We apply the model to cluster genes by patterns of gene-gene interaction. The proposed approach is based on the nested partition that is implied in the original construction of the nested Dirichlet process. It allows simulation exact inference, as opposed to a truncated Dirichlet process approximation. More importantly, the construction highlights the nature of the nested Dirichlet process as a nested partition of experimental units. We apply the proposed model to inference on clustering genes related to DNA mismatch repair (DMR) by the distribution of gene-gene interactions with other genes. Gene-gene interactions are recorded as coefficients in an auto-logistic model for the co-expression of two genes, adjusting for copy number variation, methylation and protein activation. These coefficients are extracted from an online database, called Zodiac, computed based on The Cancer Genome Atlas (TCGA) data. We compare results with a variation of k-means clustering that is set up to cluster distributions, truncated NDP and a hierarchical clustering method. The proposed inference shows favorable performance, under simulated conditions and also in the real data sets.

Keywords: Clustering distributions; Gene interactions; Nested Dirichlet process; TCGA; Zodiac.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Cluster Analysis*
  • DNA Mismatch Repair / genetics
  • Epistasis, Genetic
  • Gene Expression Profiling
  • Genes, Neoplasm
  • Humans
  • Statistical Distributions*