An increasing number of convolutional neural networks for fracture recognition and classification in orthopaedics : are these externally validated and ready for clinical application?

Luisa Oliveira E Carmo; Anke van den Merkhof; Jakub Olczak; Max Gordon; Paul C Jutte; Ruurd L Jaarsma; Frank F A IJpma; Job N Doornberg; Jasper Prijs; Machine Learning Consortium

doi:10.1302/2633-1462.210.BJO-2021-0133

An increasing number of convolutional neural networks for fracture recognition and classification in orthopaedics : are these externally validated and ready for clinical application?

Bone Jt Open. 2021 Oct;2(10):879-885. doi: 10.1302/2633-1462.210.BJO-2021-0133.

Authors

Luisa Oliveira E Carmo¹, Anke van den Merkhof^{2

3}, Jakub Olczak⁴, Max Gordon⁴, Paul C Jutte¹, Ruurd L Jaarsma^{2

3}, Frank F A IJpma⁵, Job N Doornberg^{1

2

3

5}, Jasper Prijs^{1

2

3

5}; Machine Learning Consortium⁶

Collaborators

Machine Learning Consortium:
Paul Algra⁶, Michel van den Bekerom⁶, Mohit Bhandari⁶, Michiel Bongers⁶, Charles Court-Brown⁶, Anne-Eva Bulstra⁶, Geert Buijze⁶, Sofia Bzovsky⁶, Joost Colaris⁶, Neil Chen⁶, Job Doornberg⁶, Andrew Duckworth⁶, J. Carel Goslings⁶, Max Gordon⁶, Benjamin Gravesteijn⁶, Olivier Groot⁶, Gordon Guyatt⁶, Laurent Hendrickx⁶, Beat Hintermann⁶, DirkJan Hofstee⁶, Frank IJpma⁶, Ruurd Jaarsma⁶, Stein Janssen⁶, Kyle Jeray⁶, Paul Jutte⁶, Aditya Karhade⁶, Lucien Keijser⁶, Gino Kerkhoffs⁶, David Langerhuizen⁶, Jonathan Lans⁶, Wouter Mallee⁶, Matthew Moran⁶, Margaret McQueen⁶, Marjolein Mulders⁶, Rob Nelissen⁶, Miryam Obdeijn⁶, Tarandeep Oberai⁶, Jakub Olczak⁶, Jacobien HF Oosterhoff⁶, Brad Petrisor⁶, Rudolf Poolman⁶, Jasper Prijs⁶, David Ring⁶, Paul Tornetta III⁶, David Sanders⁶, Joseph Schwab⁶, Emil H Schemitsch⁶, Niels Schep⁶, Inger Schipper⁶, Bram Schoolmeesters⁶, Joseph Schwab⁶, Marc Swiontkowski⁶, Sheila Sprague⁶, Ewout Steyerberg⁶, Vincent Stirler⁶, Paul Tornetta⁶, Stephen D Walter⁶, Monique Walenkamp⁶, Mathieu Wijffels⁶

Affiliations

¹ Department of Orthopaedic Surgery, University Medical Centre, University of Groningen, Groningen, Groningen, Netherlands.
² Department of Orthopaedic Surgery, Flinders Medical Centre, Bedford Park, Adelaide, South Australia, Australia.
³ Flinders University, Bedford Park, Adelaide, South Australia, Australia.
⁴ Institute of Clinical Sciences, Danderyd University Hospital, Karolinska Institute, Stockholm, Sweden.
⁵ Department of Trauma Surgery, University Medical Centre Groningen, University of Groningen, Groningen, Groningen, Netherlands.
⁶ Machine Learning Consortium

PMID: 34669518
PMCID: PMC8558452
DOI: 10.1302/2633-1462.210.BJO-2021-0133

Abstract

Aims: The number of convolutional neural networks (CNN) available for fracture detection and classification is rapidly increasing. External validation of a CNN on a temporally separate (separated by time) or geographically separate (separated by location) dataset is crucial to assess generalizability of the CNN before application to clinical practice in other institutions. We aimed to answer the following questions: are current CNNs for fracture recognition externally valid?; which methods are applied for external validation (EV)?; and, what are reported performances of the EV sets compared to the internal validation (IV) sets of these CNNs?

Methods: The PubMed and Embase databases were systematically searched from January 2010 to October 2020 according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. The type of EV, characteristics of the external dataset, and diagnostic performance characteristics on the IV and EV datasets were collected and compared. Quality assessment was conducted using a seven-item checklist based on a modified Methodologic Index for NOn-Randomized Studies instrument (MINORS).

Results: Out of 1,349 studies, 36 reported development of a CNN for fracture detection and/or classification. Of these, only four (11%) reported a form of EV. One study used temporal EV, one conducted both temporal and geographical EV, and two used geographical EV. When comparing the CNN's performance on the IV set versus the EV set, the following were found: AUCs of 0.967 (IV) versus 0.975 (EV), 0.976 (IV) versus 0.985 to 0.992 (EV), 0.93 to 0.96 (IV) versus 0.80 to 0.89 (EV), and F1-scores of 0.856 to 0.863 (IV) versus 0.757 to 0.840 (EV).

Conclusion: The number of externally validated CNNs in orthopaedic trauma for fracture recognition is still scarce. This greatly limits the potential for transfer of these CNNs from the developing institute to another hospital to achieve similar diagnostic performance. We recommend the use of geographical EV and statements such as the Consolidated Standards of Reporting Trials-Artificial Intelligence (CONSORT-AI), the Standard Protocol Items: Recommendations for Interventional Trials-Artificial Intelligence (SPIRIT-AI) and the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis-Machine Learning (TRIPOD-ML) to critically appraise performance of CNNs and improve methodological rigor, quality of future models, and facilitate eventual implementation in clinical practice. Cite this article: Bone Jt Open 2021;2(10):879-885.

Keywords: Artificial intelligence; CT scans; Convolutional neural networks; Deep learning; External validation; Machine learning; Prognosis; cadaveric studies; distal radius fractures; elbows; hip; orthopaedic surgeons; orthopaedic trauma; radiographs; variances.