Beyond standard pipeline and p < 0.05 in pathway enrichment analyses

Comput Biol Chem. 2021 Jun:92:107455. doi: 10.1016/j.compbiolchem.2021.107455. Epub 2021 Feb 12.

Abstract

A standard pathway/gene-set enrichment analysis, the over-representation analysis, is based on four values: the size of two gene-sets, size of their overlap, and size of the gene universe from which the gene-sets are chosen. The standard result of such an analysis is based on the p-value of a statistical test. We supplement this standard pipeline by six cautions: (1) any p-value threshold to distinguish enriched gene-sets from not-enriched ones is to certain degree arbitrary; (2) genes in a gene-set may be correlated, which potentially overcount the gene-set size; (3) any attempt to impose multiple testing correction will increase the false negative rate; (4) gene-sets in a gene-set database may be correlated, potentially overcount the factor for multiple testing correction; (5) the discrete nature of the data make it possible that a minimum change in counts may lead to a quantum change in the p-value threshold-based conclusion; (6) the two gene-sets may not be chosen from the universe of all human genes, but in fact from a subset of that universe, or even two different subsets of all genes. Careful reconsideration of these issues can have an impact on an enrichment analysis conclusion. Part of our cautions mirror the call from statistician that reaching conclusion from data is not a simple matter of p-value smaller than 0.05, but a thoughtful process with due diligences.

Keywords: Gene-set enrichment; Human genes; Pathway analysis; Pipelines; Statistical significance.

Publication types

  • Review

MeSH terms

  • Algorithms*
  • Computational Biology*
  • Databases, Genetic*
  • Gene Expression Profiling
  • Humans