Upscaling human activity data: A statistical ecology approach

PLoS One. 2021 Jul 1;16(7):e0253461. doi: 10.1371/journal.pone.0253461. eCollection 2021.

Abstract

Big data require new techniques to handle the information they come with. Here we consider four datasets (email communication, Twitter posts, Wikipedia articles and Gutenberg books) and propose a novel statistical framework to predict global statistics from random samples. More precisely, we infer the number of senders, hashtags and words of the whole dataset and how their abundances (i.e. the popularity of a hashtag) change through scales from a small sample of sent emails per sender, posts per hashtag and word occurrences. Our approach is grounded on statistical ecology as we map inference of human activities into the unseen species problem in biodiversity. Our findings may have applications to resource management in emails, collective attention monitoring in Twitter and language learning process in word databases.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Big Data*
  • Computer Communication Networks / statistics & numerical data*
  • Datasets as Topic*
  • Electronic Mail / statistics & numerical data*
  • Humans
  • Social Media / statistics & numerical data*

Grants and funding

A. Tovo acknowledges financial support from neXt grant, Department of Mathematics “Tullio Levi-Civita” of University of Padova. S. Suweis and A. Tovo acknowledge STARS grant 2019 from University of Padova. S. Stivanello acknowledges financial support from Progetto Dottorati - Fondazione Cassa di Risparmio di Padova e Rovigo. A. Tovo and A. Maritan acknowledge the support from University of Padova through “Excellence Project 2018” of the Cariparo foundation. S. Favaro received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 817257. S. Favaro gratefully acknowledges the financial support from the Italian Ministry of Education, University and Research (MIUR), “Dipartimenti di Eccellenza” grant 2018-2022. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.