Beyond standard pipeline and p < 0.05 in pathway enrichment analyses

Wentian Li; Andrew Shih; Yun Freudenberg-Hua; Wen Fury; Yaning Yang

doi:10.1016/j.compbiolchem.2021.107455

Beyond standard pipeline and p < 0.05 in pathway enrichment analyses

Comput Biol Chem. 2021 Jun:92:107455. doi: 10.1016/j.compbiolchem.2021.107455. Epub 2021 Feb 12.

Authors

Wentian Li¹, Andrew Shih¹, Yun Freudenberg-Hua², Wen Fury³, Yaning Yang⁴

Affiliations

¹ The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
² Litwin-Zucker Center for the study of Alzheimer's Disease, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA; Division of Geriatric Psychiatry, Zucker Hillside Hospital, Northwell Health, Glen Oaks, NY, USA.
³ Regeneron Pharmaceutical Inc., Tarrytown, NY, USA.
⁴ Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui, China.

Abstract

A standard pathway/gene-set enrichment analysis, the over-representation analysis, is based on four values: the size of two gene-sets, size of their overlap, and size of the gene universe from which the gene-sets are chosen. The standard result of such an analysis is based on the p-value of a statistical test. We supplement this standard pipeline by six cautions: (1) any p-value threshold to distinguish enriched gene-sets from not-enriched ones is to certain degree arbitrary; (2) genes in a gene-set may be correlated, which potentially overcount the gene-set size; (3) any attempt to impose multiple testing correction will increase the false negative rate; (4) gene-sets in a gene-set database may be correlated, potentially overcount the factor for multiple testing correction; (5) the discrete nature of the data make it possible that a minimum change in counts may lead to a quantum change in the p-value threshold-based conclusion; (6) the two gene-sets may not be chosen from the universe of all human genes, but in fact from a subset of that universe, or even two different subsets of all genes. Careful reconsideration of these issues can have an impact on an enrichment analysis conclusion. Part of our cautions mirror the call from statistician that reaching conclusion from data is not a simple matter of p-value smaller than 0.05, but a thoughtful process with due diligences.

Keywords: Gene-set enrichment; Human genes; Pathway analysis; Pipelines; Statistical significance.

Publication types

Review

MeSH terms

Algorithms*
Computational Biology*
Databases, Genetic*
Gene Expression Profiling
Humans

Grants and funding

K08 AG054727/AG/NIA NIH HHS/United States