A methodology for preprocessing structured big data in the behavioral sciences

Behav Res Methods. 2023 Jun;55(4):1818-1838. doi: 10.3758/s13428-022-01895-4. Epub 2022 Jun 29.

Abstract

The characteristics of big data, including high volume, increased variety, and velocity, pose special challenges for data analysis. As these characteristics generally preclude manual data inspection and processing, researchers must often use computational methodologies to deal with this type of data; techniques that may be unfamiliar to nonspecialists, including behavioral scientists. However, previous data analytics methodologies within the field of computer science, developed to handle the generic tasks of data collection, preprocessing, and analysis, can be appropriated for use in other disciplines. These methodologies involve a sequential pipeline of quality checks to prepare data sets for analysis and application. Building upon these methodologies, this paper describes the Big Data Quality & Statistical Assurance (BDQSA) model, applicable for researchers in the behavioral sciences. It involves a series of data preprocessing tasks, to achieve data understanding, as well as data screening, cleaning, and transformation. These are followed by a statistical quality phase, which includes extraction of the relevant data subset, type conversions, ensuring sample representativeness when appropriate, and assessing statistical assumptions. The resulting model thereby provides methodological guidance for the preprocessing of behavioral science big data, aimed at ensuring acceptable data quality before analysis is undertaken. Sample R code snippets demonstrating the application of this model are provided throughout the paper.

Keywords: Behavioral science research; Behavioral sciences; Big data; Data preprocessing; Personality big data.

MeSH terms

  • Big Data*
  • Computers*
  • Data Accuracy
  • Data Collection
  • Humans
  • Research Personnel