Humans in the Loop: Incorporating Expert and Crowd-Sourced Knowledge for Predictions Using Survey Data

Anna Filippova; Connor Gilroy; Ridhi Kashyap; Antje Kirchner; Allison C Morgan; Kivan Polimis; Adaner Usmani; Tong Wang

doi:10.1177/2378023118820157

Humans in the Loop: Incorporating Expert and Crowd-Sourced Knowledge for Predictions Using Survey Data

Socius. 2019 Jan-Dec:5:10.1177/2378023118820157. doi: 10.1177/2378023118820157. Epub 2019 Sep 10.

Authors

Anna Filippova¹, Connor Gilroy², Ridhi Kashyap³, Antje Kirchner^{4

5}, Allison C Morgan⁶, Kivan Polimis⁷, Adaner Usmani⁸, Tong Wang⁹

Affiliations

¹ GitHub, Carnegie Mellon University, San Francisco, CA, USA.
² University of Washington, Seattle, WA, USA.
³ University of Oxford, Oxford, UK.
⁴ RTI International, Research Triangle Park, NC, USA.
⁵ University of Nebraska-Lincoln, Lincoln, NE, USA.
⁶ University of Colorado, Boulder, CO, USA.
⁷ Dondena Centre, Università Bocconi, Bocconi Institute for Data Science and Analytics, Milan, Italy.
⁸ Brown University, Providence, RI, USA.
⁹ University of Iowa, Iowa City, IA, USA.

Abstract

Survey data sets are often wider than they are long. This high ratio of variables to observations raises concerns about overfitting during prediction, making informed variable selection important. Recent applications in computer science have sought to incorporate human knowledge into machine-learning methods to address these problems. The authors implement such a "human-in-the-loop" approach in the Fragile Families Challenge. The authors use surveys to elicit knowledge from experts and laypeople about the importance of different variables to different outcomes. This strategy offers the option to subset the data before prediction or to incorporate human knowledge as scores in prediction models, or both together. The authors find that human intervention is not obviously helpful. Human-informed subsetting reduces predictive performance, and considered alone, approaches incorporating scores perform marginally worse than approaches that do not. However, incorporating human knowledge may still improve predictive performance, and future research should consider new ways of doing so.

Keywords: Fragile Families Challenge; machine learning; missing data; prediction; surveys.

Grants and funding

P2C HD042828/HD/NICHD NIH HHS/United States