Moment based gene set tests

BMC Bioinformatics. 2015 Apr 28:16:132. doi: 10.1186/s12859-015-0571-7.

Abstract

Background: Permutation-based gene set tests are standard approaches for testing relationships between collections of related genes and an outcome of interest in high throughput expression analyses. Using M random permutations, one can attain p-values as small as 1/(M+1). When many gene sets are tested, we need smaller p-values, hence larger M, to achieve significance while accounting for the number of simultaneous tests being made. As a result, the number of permutations to be done rises along with the cost per permutation. To reduce this cost, we seek parametric approximations to the permutation distributions for gene set tests.

Results: We study two gene set methods based on sums and sums of squared correlations. The statistics we study are among the best performers in the extensive simulation of 261 gene set methods by Ackermann and Strimmer in 2009. Our approach calculates exact relevant moments of these statistics and uses them to fit parametric distributions. The computational cost of our algorithm for the linear case is on the order of doing |G| permutations, where |G| is the number of genes in set G. For the quadratic statistics, the cost is on the order of |G|(2) permutations which can still be orders of magnitude faster than plain permutation sampling. We applied the permutation approximation method to three public Parkinson's Disease expression datasets and discovered enriched gene sets not previously discussed. We found that the moment-based gene set enrichment p-values closely approximate the permutation method p-values at a tiny fraction of their cost. They also gave nearly identical rankings to the gene sets being compared.

Conclusions: We have developed a moment based approximation to linear and quadratic gene set test statistics' permutation distribution. This allows approximate testing to be done orders of magnitude faster than one could do by sampling permutations. We have implemented our method as a publicly available Bioconductor package, npGSEA (www.bioconductor.org) .

MeSH terms

  • Algorithms*
  • Biomarkers / metabolism*
  • Data Interpretation, Statistical
  • Gene Expression Profiling*
  • Genomics / methods*
  • Humans
  • Models, Statistical*
  • Parkinson Disease / genetics*

Substances

  • Biomarkers