Permutation tests are robust and powerful at 0.5% and 5% significance levels

Behav Res Methods. 2021 Dec;53(6):2712-2724. doi: 10.3758/s13428-021-01595-5. Epub 2021 May 28.

Abstract

Recent replication crisis has led to a number of ad hoc suggestions to decrease the chance of making false positive findings. Among them, Johnson (Proceedings of the National Academy of Sciences, 110, 19313-19317, 2013) and Benjamin et al. (Nature Human Behaviour, 2, 6-10 2018) recommend using the significance level of α = 0.005 (0.5%) as opposed to the conventional 0.05 (5%) level. Even though their suggestion is easy to implement, it is unclear whether or not the commonly used statistical tests are robust and/or powerful at such a small significance level. Therefore, the main aim of our study is to investigate the robustness and power curve behaviors of independent (unpaired) two-sample tests for metric and ordinal data at nominal significance levels of α = 0.005 and α = 0.05. Through an extensive simulation study, it is found that the permutation versions of the Welch t-test and the Brunner-Munzel test are particularly robust and powerful while the commonly used two-sample tests which utilize t-distribution tend to be either liberal or conservative, and have peculiar power curve behaviors under skewed distributions with variance heterogeneity.

Keywords: Nonparametric tests; Permutation tests; Replication crisis; Reproducibility issue; Robust statistics; Statistical evidence; Statistical significance.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computer Simulation
  • False Positive Reactions*
  • Humans
  • Models, Statistical*
  • Probability
  • Statistical Distributions*