Describing the Pearson R distribution of aggregate data

David J Torres

doi:10.1515/mcma-2020-2054

Describing the Pearson R distribution of aggregate data

Monte Carlo Methods Appl. 2020 Mar;26(1):17-32. doi: 10.1515/mcma-2020-2054. Epub 2020 Feb 5.

Author

David J Torres¹

Affiliation

¹ Department of Mathematics and Physical Science, Northern New Mexico College, Española, NM, USA.

Abstract

Ecological studies and epidemiology need to use group averaged data to make inferences about individual patterns. However, using correlations based on averages to estimate correlations of individual scores is subject to an "ecological fallacy". The purpose of this article is to create distributions of Pearson R correlation values computed from grouped averaged or aggregate data using Monte Carlo simulations and random sampling. We show that, as the group size increases, the distributions can be approximated by a generalized hypergeometric distribution. The expectation of the constructed distribution slightly underestimates the individual Pearson R value, but the difference becomes smaller as the number of groups increases. The approximate normal distribution resulting from Fisher's transformation can be used to build confidence intervals to approximate the Pearson R value based on individual scores from the Pearson R value based on the aggregated scores.

Keywords: Monte Carlo simulations; Pearson R correlation; aggregate data; confidence intervals.

Grants and funding

P20 GM103451/GM/NIGMS NIH HHS/United States