Reliability estimates: behavioural stations and questionnaires in medical school admissions

Naomi Gafni; Avital Moshinsky; Orit Eisenberg; David Zeigler; Amitai Ziv

doi:10.1111/j.1365-2923.2011.04155.x

Reliability estimates: behavioural stations and questionnaires in medical school admissions

Med Educ. 2012 Mar;46(3):277-88. doi: 10.1111/j.1365-2923.2011.04155.x.

Authors

Naomi Gafni¹, Avital Moshinsky, Orit Eisenberg, David Zeigler, Amitai Ziv

Affiliation

¹ National Institute for Testing and Evaluation (NITE), Jerusalem, Israel. naomi@nite.org.il

PMID: 22324527
DOI: 10.1111/j.1365-2923.2011.04155.x

Abstract

Context: Assessment centres used in evaluating the non-cognitive attributes of medical school candidates must generate scores that reflect as accurate a measurement as possible of these attributes. Thus far, reliability coefficients for such centres have been based on limited samples and individual administrations, without reference to the error of variance that may result from retesting, or from the existence of multiple centres designed to measure the same attributes.

Methods: The National Institute for Testing and Evaluation in Israel has developed and administered two assessment centres: MOR is used by two medical schools and one dental school, and MIRKAM by another medical school. Each centre comprises eight or nine behavioural stations, a standardised biographical questionnaire, and a judgement and decision-making questionnaire. We calculated generalisability coefficients for each centre's eight or nine stations by year, composite reliability coefficients for the overall assessment centres, test-retest correlation coefficients for repeaters, and a correlation coefficient between the centres.

Results: Between 2006 and 2009, 2662 and 2023 examinees participated in MOR and MIRKAM, respectively; 1479 of these participated in both. The average generalisability coefficients for the stations were 0.69 for MOR and 0.67 for MIRKAM. The composite reliability coefficients for the full centres (behavioural stations plus questionnaires) were 0.79 and 0.76 for MOR and MIRKAM, respectively. The correlations for repeaters, corrected for restriction of range, were 0.59 and 0.43 for MOR and MIRKAM stations, respectively, and 0.72 and 0.65 for the full MOR and MIRKAM assessments, respectively. The correlation between scores on the MOR and MIRKAM stations was 0.56 (0.75 for the overall score).

Discussion: The minimal reliability desirable for high-stakes decision making (0.80) was obtained only for 14 or 15 stations with questionnaires. Nevertheless, the values obtained are considerably higher than reliability coefficients for single interviews. The questionnaires contribute significantly to the accuracy of the measurement. These reliability measures constitute an upper threshold for measures of validity.

MeSH terms

Analysis of Variance
Behavior
Decision Making
Education, Medical, Undergraduate / standards*
Educational Measurement / methods*
Humans
Interviews as Topic
Israel
Judgment
Psychometrics
Reproducibility of Results
School Admission Criteria*
Schools, Medical
Students, Medical / psychology*
Surveys and Questionnaires