Permutation tests are robust and powerful at 0.5% and 5% significance levels

Kimihiro Noguchi; Frank Konietschke; Fernando Marmolejo-Ramos; Markus Pauly

doi:10.3758/s13428-021-01595-5

Permutation tests are robust and powerful at 0.5% and 5% significance levels

Behav Res Methods. 2021 Dec;53(6):2712-2724. doi: 10.3758/s13428-021-01595-5. Epub 2021 May 28.

Authors

Kimihiro Noguchi¹, Frank Konietschke^{2

3}, Fernando Marmolejo-Ramos⁴, Markus Pauly⁵

Affiliations

¹ Department of Mathematics, Western Washington University, Bellingham, WA, 98225, USA. Kimihiro.Noguchi@wwu.edu.
² Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, Berlin, 10117, Germany.
³ Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, Berlin, 10178, Germany.
⁴ Centre for Change and Complexity in Learning, University of South Australia, Adelaide, South Australia, 5005, Australia.
⁵ Department of Statistics, TU Dortmund University, Dortmund, 44227, Germany.

PMID: 34050436
DOI: 10.3758/s13428-021-01595-5

Abstract

Recent replication crisis has led to a number of ad hoc suggestions to decrease the chance of making false positive findings. Among them, Johnson (Proceedings of the National Academy of Sciences, 110, 19313-19317, 2013) and Benjamin et al. (Nature Human Behaviour, 2, 6-10 2018) recommend using the significance level of α = 0.005 (0.5%) as opposed to the conventional 0.05 (5%) level. Even though their suggestion is easy to implement, it is unclear whether or not the commonly used statistical tests are robust and/or powerful at such a small significance level. Therefore, the main aim of our study is to investigate the robustness and power curve behaviors of independent (unpaired) two-sample tests for metric and ordinal data at nominal significance levels of α = 0.005 and α = 0.05. Through an extensive simulation study, it is found that the permutation versions of the Welch t-test and the Brunner-Munzel test are particularly robust and powerful while the commonly used two-sample tests which utilize t-distribution tend to be either liberal or conservative, and have peculiar power curve behaviors under skewed distributions with variance heterogeneity.

Keywords: Nonparametric tests; Permutation tests; Replication crisis; Reproducibility issue; Robust statistics; Statistical evidence; Statistical significance.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computer Simulation
False Positive Reactions*
Humans
Models, Statistical*
Probability
Statistical Distributions*