An evaluation of statistical methods for DNA methylation microarray data analysis

BMC Bioinformatics. 2015 Jul 10:16:217. doi: 10.1186/s12859-015-0641-x.

Abstract

Background: DNA methylation offers an excellent example for elucidating how epigenetic information affects gene expression. β values and M values are commonly used to quantify DNA methylation. Statistical methods applicable to DNA methylation data analysis span a number of approaches such as Wilcoxon rank sum test, t-test, Kolmogorov-Smirnov test, permutation test, empirical Bayes method, and bump hunting method. Nonetheless, selection of an optimal statistical method can be challenging when different methods generate inconsistent results from the same data set.

Results: We compared six statistical approaches relevant to DNA methylation microarray analysis in terms of false discovery rate control, statistical power, and stability through simulation studies and real data examples. Observable differences were noticed between β values and M values only when methylation levels were correlated across CpG loci. For small sample size (n=3 or 6 in each group), both the empirical Bayes and bump hunting methods showed appropriate FDR control and the highest power when methylation levels across CpG loci were independent. Only the bump hunting method showed appropriate FDR control and the highest power when methylation levels across CpG sites were correlated. For medium (n=12 in each group) and large sample sizes (n=24 in each group), all methods compared had similar power, except for the permutation test whenever the proportion of differentially methylated loci was low. For all sample sizes, the bump hunting method had the lowest stability in terms of standard deviation of total discoveries whenever the proportion of differentially methylated loci was large. The apparent test power comparisons based on raw p-values from DNA methylation studies on ovarian cancer and rheumatoid arthritis provided results as consistent as those obtained in the simulation studies. Overall, these results provide guidance for optimal statistical methods selection under different scenarios.

Conclusions: For DNA methylation studies with small sample size, the bump hunting method and the empirical Bayes method are recommended when DNA methylation levels across CpG loci are independent, while only the bump hunting method is recommended when DNA methylation levels are correlated across CpG loci. All methods are acceptable for medium or large sample sizes.

Publication types

  • Evaluation Study
  • Research Support, N.I.H., Extramural

MeSH terms

  • Arthritis, Rheumatoid / genetics*
  • Bayes Theorem
  • Computer Simulation*
  • DNA Methylation*
  • Female
  • Humans
  • Models, Statistical*
  • Oligonucleotide Array Sequence Analysis / methods*
  • Ovarian Neoplasms / genetics*
  • Sample Size
  • Statistics, Nonparametric