Inter-rater reliability of risk of bias tools for non-randomized studies

Isabel Kalaycioglu; Bastien Rioux; Joel Neves Briard; Ahmad Nehme; Lahoud Touma; Bénédicte Dansereau; Ariane Veilleux-Carpentier; Mark R Keezer

doi:10.1186/s13643-023-02389-w

Inter-rater reliability of risk of bias tools for non-randomized studies

Syst Rev. 2023 Dec 7;12(1):227. doi: 10.1186/s13643-023-02389-w.

Authors

Isabel Kalaycioglu¹, Bastien Rioux^{1

2

3}, Joel Neves Briard^{1

2

3}, Ahmad Nehme^{1

2

3}, Lahoud Touma^{1

2

3}, Bénédicte Dansereau^{1

2

3}, Ariane Veilleux-Carpentier^{1

2

3}, Mark R Keezer^{4

5

6

7}

Affiliations

¹ Faculty of Medicine, Université de Montréal, Montreal, QC, Canada.
² Department of Neurosciences, Université de Montréal, Montreal, QC, Canada.
³ Centre Hospitalier de L'Université de Montréal, Pavillon R R04-700, 1000 Saint-Denis St., Montreal, QC, H2X 0C1, Canada.
⁴ Faculty of Medicine, Université de Montréal, Montreal, QC, Canada. mark.keezer@umontreal.ca.
⁵ Department of Neurosciences, Université de Montréal, Montreal, QC, Canada. mark.keezer@umontreal.ca.
⁶ Centre Hospitalier de L'Université de Montréal, Pavillon R R04-700, 1000 Saint-Denis St., Montreal, QC, H2X 0C1, Canada. mark.keezer@umontreal.ca.
⁷ School of Public Health, Université de Montréal, Montreal, QC, Canada. mark.keezer@umontreal.ca.

Abstract

Purpose: There is limited knowledge on the reliability of risk of bias (ROB) tools for assessing internal validity in systematic reviews of exposure and frequency studies. We aimed to identify and then compare the inter-rater reliability (IRR) of six commonly used tools for frequency (Loney scale, Gyorkos checklist, American Academy of Neurology [AAN] tool) and exposure (Newcastle-Ottawa scale, SIGN50 checklist, AAN tool) studies.

Methods: Six raters independently assessed the ROB of 30 frequency and 30 exposure studies using the three respective ROB tools. Articles were rated as low, intermediate, or high ROB. We calculated an intraclass correlation coefficient (ICC) for each tool and category of ROB tool. We compared the IRR between ROB tools and tool type by inspection of overlapping ICC 95% CIs and by comparing their coefficients after transformation to Fisher's Z values. We assessed the criterion validity of the AAN ROB tools by calculating an ICC for each rater in comparison with the original ratings from the AAN.

Results: All individual ROB tools had an IRR in the substantial range or higher (ICC point estimates between 0.61 and 0.80). The IRR was almost perfect (ICC point estimate > 0.80) for the AAN frequency tool and the SIGN50 checklist. All tools were comparable in IRR, except for the AAN frequency tool which had a significantly higher ICC than the Gyorkos checklist (p = 0.021) and trended towards a higher ICC when compared to the Loney scale (p = 0.085). When examined by category of ROB tool, scales, and checklists had a substantial IRR, whereas the AAN tools had an almost perfect IRR. For the criterion validity of the AAN ROB tools, the average agreement between our raters and the original AAN ratings was moderate.

Conclusion: All tools had substantial IRRs except for the AAN frequency tool and the SIGN50 checklist, which both had an almost perfect IRR. The AAN ROB tools were the only category of ROB tools to demonstrate an almost perfect IRR. This category of ROB tools had fewer and simpler criteria. Overall, parsimonious tools with clear instructions, such as those from the AAN, may provide more reliable ROB assessments.

Keywords: Neurology; ROB assessments; Systematic reviews.

MeSH terms

Bias
Checklist*
Humans
Reproducibility of Results
Risk Assessment
Systematic Reviews as Topic