Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys

PLoS One. 2020 Apr 22;15(4):e0231500. doi: 10.1371/journal.pone.0231500. eCollection 2020.

Abstract

Modern survey methods may be subject to non-observable bias, from various sources. Among online surveys, for example, selection bias is prevalent, due to the sampling mechanism commonly used, whereby participants self-select from a subgroup whose characteristics differ from those of the target population. Several techniques have been proposed to tackle this issue. One such is Propensity Score Adjustment (PSA), which is widely used and has been analysed in various studies. The usual method of estimating the propensity score is logistic regression, which requires a reference probability sample in addition to the online nonprobability sample. The predicted propensities can be used for reweighting using various estimators. However, in the online survey context, there are alternatives that might outperform logistic regression regarding propensity estimation. The aim of the present study is to determine the efficiency of some of these alternatives, involving Machine Learning (ML) classification algorithms. PSA is applied in two simulation scenarios, representing situations commonly found in online surveys, using logistic regression and ML models for propensity estimation. The results obtained show that ML algorithms remove selection bias more effectively than logistic regression when used for PSA, but that their efficacy depends largely on the selection mechanism employed and the dimensionality of the data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computer Simulation / statistics & numerical data
  • Data Interpretation, Statistical
  • Humans
  • Logistic Models
  • Machine Learning / statistics & numerical data*
  • Propensity Score
  • Research Design / statistics & numerical data
  • Selection Bias
  • Surveys and Questionnaires / statistics & numerical data*

Grants and funding

This study was partially supported by Ministerio de Economía y Competitividad, Spain [grant number MTM2015-63609-R] and, in terms of the first author, a FPU grant from the Ministerio de Ciencia, Innovación y Universidades, Spain. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.