A nonparametric Bayesian approach for clustering bisulfate-based DNA methylation profiles

BMC Genomics. 2012;13 Suppl 6(Suppl 6):S20. doi: 10.1186/1471-2164-13-S6-S20. Epub 2012 Oct 26.

Abstract

Background: DNA methylation occurs in the context of a CpG dinucleotide. It is an important epigenetic modification, which can be inherited through cell division. The two major types of methylation include hypomethylation and hypermethylation. Unique methylation patterns have been shown to exist in diseases including various types of cancer. DNA methylation analysis promises to become a powerful tool in cancer diagnosis, treatment and prognostication. Large-scale methylation arrays are now available for studying methylation genome-wide. The Illumina methylation platform simultaneously measures cytosine methylation at more than 1500 CpG sites associated with over 800 cancer-related genes. Cluster analysis is often used to identify DNA methylation subgroups for prognosis and diagnosis. However, due to the unique non-Gaussian characteristics, traditional clustering methods may not be appropriate for DNA and methylation data, and the determination of optimal cluster number is still problematic.

Method: A Dirichlet process beta mixture model (DPBMM) is proposed that models the DNA methylation expressions as an infinite number of beta mixture distribution. The model allows automatic learning of the relevant parameters such as the cluster mixing proportion, the parameters of beta distribution for each cluster, and especially the number of potential clusters. Since the model is high dimensional and analytically intractable, we proposed a Gibbs sampling "no-gaps" solution for computing the posterior distributions, hence the estimates of the parameters.

Result: The proposed algorithm was tested on simulated data as well as methylation data from 55 Glioblastoma multiform (GBM) brain tissue samples. To reduce the computational burden due to the high data dimensionality, a dimension reduction method is adopted. The two GBM clusters yielded by DPBMM are based on data of different number of loci (P-value < 0.1), while hierarchical clustering cannot yield statistically significant clusters.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Bayes Theorem
  • Brain Neoplasms / genetics
  • Brain Neoplasms / metabolism
  • Brain Neoplasms / pathology
  • Cluster Analysis
  • CpG Islands
  • DNA / chemistry*
  • DNA Methylation*
  • Glioblastoma / genetics
  • Glioblastoma / metabolism
  • Glioblastoma / pathology
  • Humans
  • Models, Genetic
  • Sulfates / chemistry*

Substances

  • Sulfates
  • DNA