What is a Good Calibration Question?

Victoria Hemming; Anca M Hanea; Mark A Burgman

doi:10.1111/risa.13725

What is a Good Calibration Question?

Risk Anal. 2022 Feb;42(2):264-278. doi: 10.1111/risa.13725. Epub 2021 Apr 16.

Authors

Victoria Hemming^{1

2}, Anca M Hanea², Mark A Burgman³

Affiliations

¹ Martin Conservation Decisions Lab, Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, Canada.
² Centre of Excellence for Biosecurity Risk Analysis, University of Melbourne, Victoria, Australia.
³ Centre for Environmental Policy, Imperial College London, London, UK.

PMID: 33864272
DOI: 10.1111/risa.13725

Abstract

Weighted aggregation of expert judgments based on their performance on calibration questions may improve mathematically aggregated judgments relative to equal weights. However, obtaining validated, relevant calibration questions can be difficult. If so, should analysts settle for equal weights? Or should they use calibration questions that are easier to obtain but less relevant? In this article, we examine what happens to the out-of-sample performance of weighted aggregations of the classical model (CM) compared to equal weighted aggregations when the set of calibration questions includes many so-called "irrelevant" questions, those that might ordinarily be considered to be outside the domain of the questions of interest. We find that performance weighted aggregations outperform equal weights on the combined CM score, but not on statistical accuracy (i.e., calibration). Importantly, there was no appreciable difference in performance when weights were developed on relevant versus irrelevant questions. Experts were unable to adapt their knowledge across vastly different domains, and in-sample validation did not accurately predict out-of-sample performance on irrelevant questions. We suggest that if relevant calibration questions cannot be found, then analysts should use equal weights, and draw on alternative techniques to improve judgments. Our study also indicates limits to the predictive accuracy of performance weighted aggregation, and the degree to which expertise can be adapted across domains. We note limitations in our study and urge further research into the effect of question type on the reliability of performance weighted aggregations.

Keywords: Aggregation; calibration; equal weights; expert judgment; performance weights.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Calibration
Judgment*
Reproducibility of Results