Towards a more nuanced conceptualisation of differential examiner stringency in OSCEs

Matt Homer

doi:10.1007/s10459-023-10289-w

Towards a more nuanced conceptualisation of differential examiner stringency in OSCEs

Adv Health Sci Educ Theory Pract. 2023 Oct 16. doi: 10.1007/s10459-023-10289-w. Online ahead of print.

Author

Matt Homer¹

Affiliation

¹ School of Medicine, University of Leeds, Leeds, LS2 JT, UK. m.s.homer@leeds.ac.uk.

PMID: 37843678
DOI: 10.1007/s10459-023-10289-w

Abstract

Quantitative measures of systematic differences in OSCE scoring across examiners (often termed examiner stringency) can threaten the validity of examination outcomes. Such effects are usually conceptualised and operationalised based solely on checklist/domain scores in a station, and global grades are not often used in this type of analysis. In this work, a large candidate-level exam dataset is analysed to develop a more sophisticated understanding of examiner stringency. Station scores are modelled based on global grades-with each candidate, station and examiner allowed to vary in their ability/stringency/difficulty in the modelling. In addition, examiners are also allowed to vary in how they discriminate across grades-to our knowledge, this is the first time this has been investigated. Results show that examiners contribute strongly to variance in scoring in two distinct ways-via the traditional conception of score stringency (34% of score variance), but also in how they discriminate in scoring across grades (7%). As one might expect, candidate and station account only for a small amount of score variance at the station-level once candidate grades are accounted for (3% and 2% respectively) with the remainder being residual (54%). Investigation of impacts on station-level candidate pass/fail decisions suggest that examiner differential stringency effects combine to give false positive (candidates passing in error) and false negative (failing in error) rates in stations of around 5% each but at the exam-level this reduces to 0.4% and 3.3% respectively. This work adds to our understanding of examiner behaviour by demonstrating that examiners can vary in qualitatively different ways in their judgments. For institutions, it emphasises the key message that it is important to sample widely from the examiner pool via sufficient stations to ensure OSCE-level decisions are sufficiently defensible. It also suggests that examiner training should include discussion of global grading, and the combined effect of scoring and grading on candidate outcomes.

Keywords: Borderline regression; Examiner stringency; OSCE; Standard setting.