Using synthetic data to improve the reproducibility of statistical results in psychological research

Psychol Methods. 2022 Aug 4. doi: 10.1037/met0000526. Online ahead of print.

Abstract

In recent years, psychological research has faced a credibility crisis, and open data are often regarded as an important step toward a more reproducible psychological science. However, privacy concerns are among the main reasons that prevent data sharing. Synthetic data procedures, which are based on the multiple imputation (MI) approach to missing data, can be used to replace sensitive data with simulated values, which can be analyzed in place of the original data. One crucial requirement of this approach is that the synthesis model is correctly specified. In this article, we investigated the statistical properties of synthetic data with a particular emphasis on the reproducibility of statistical results. To this end, we compared conventional approaches to synthetic data based on MI with a data-augmented approach (DA-MI) that attempts to combine the advantages of masking methods and synthetic data, thus making the procedure more robust to misspecification. In multiple simulation studies, we found that the good properties of the MI approach strongly depend on the correct specification of the synthesis model, whereas the DA-MI approach can provide useful results even under various types of misspecification. This suggests that the DA-MI approach to synthetic data can provide an important tool that can be used to facilitate data sharing and improve reproducibility in psychological research. In a working example, we also demonstrate the implementation of these approaches in widely available software, and we provide recommendations for practice. (PsycInfo Database Record (c) 2022 APA, all rights reserved).