Predicting data saturation in qualitative surveys with mathematical models from ecological research

J Clin Epidemiol. 2017 Feb:82:71-78.e2. doi: 10.1016/j.jclinepi.2016.10.001. Epub 2016 Oct 24.

Abstract

Objective: Sample size in surveys with open-ended questions relies on the principle of data saturation. Determining the point of data saturation is complex because researchers have information on only what they have found. The decision to stop data collection is solely dictated by the judgment and experience of researchers. In this article, we present how mathematical modeling may be used to describe and extrapolate the accumulation of themes during a study to help researchers determine the point of data saturation.

Study design and setting: The model considers a latent distribution of the probability of elicitation of all themes and infers the accumulation of themes as arising from a mixture of zero-truncated binomial distributions. We illustrate how the model could be used with data from a survey with open-ended questions on the burden of treatment involving 1,053 participants from 34 different countries and with various conditions. The performance of the model in predicting the number of themes to be found with the inclusion of new participants was investigated by Monte Carlo simulations. Then, we tested how the slope of the expected theme accumulation curve could be used as a stopping criterion for data collection in surveys with open-ended questions.

Results: By doubling the sample size after the inclusion of initial samples of 25 to 200 participants, the model reliably predicted the number of themes to be found. Mean estimation error ranged from 3% to 1% with simulated data and was <2% with data from the study of the burden of treatment. Sequentially calculating the slope of the expected theme accumulation curve for every five new participants included was a feasible approach to balance the benefits of including these new participants in the study. In our simulations, a stopping criterion based on a value of 0.05 for this slope allowed for identifying 97.5% of the themes while limiting the inclusion of participants eliciting nothing new in the study.

Conclusion: Mathematical models adapted from ecological research can accurately predict the point of data saturation in surveys with open-ended questions.

Keywords: Data saturation; Open-ended questions; Qualitative research; Sample size; Surveys and questionnaires; Web-based questionnaires.

MeSH terms

  • Data Collection / statistics & numerical data*
  • Epidemiologic Studies*
  • Humans
  • Models, Theoretical*
  • Qualitative Research*
  • Sample Size