Generalizability of an acute kidney injury prediction model across health systems

Jie Cao; Xiaosong Zhang; Vahakn Shahinian; Huiying Yin; Diane Steffick; Rajiv Saran; Susan Crowley; Michael Mathis; Girish N Nadkarni; Michael Heung; Karandeep Singh

doi:10.1038/s42256-022-00563-8

Generalizability of an acute kidney injury prediction model across health systems

Nat Mach Intell. 2022 Dec;4(12):1121-1129. doi: 10.1038/s42256-022-00563-8. Epub 2022 Dec 1.

Authors

Jie Cao¹, Xiaosong Zhang², Vahakn Shahinian^{2

3}, Huiying Yin², Diane Steffick², Rajiv Saran^{2

3

4}, Susan Crowley⁵, Michael Mathis⁶, Girish N Nadkarni^{7

8}, Michael Heung^{2

3}, Karandeep Singh^{3

9

10}

Affiliations

¹ Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI.
² Kidney Epidemiology and Cost Center, School of Public Health, University of Michigan, Ann Arbor, MI.
³ Division of Nephrology, Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, MI.
⁴ Department of Epidemiology, School of Public Health, University of Michigan, Ann Arb, MI.
⁵ Renal Section, VA Connecticut Healthcare System, West Haven, CT.
⁶ Department of Anesthesiology, University of Michigan Medical School, Ann Arbor, MI.
⁷ Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, NY.
⁸ Division of Data Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY.
⁹ Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI.
¹⁰ School of Information, University of Michigan, Ann Arbor, MI.

Abstract

Delays in the identification of acute kidney injury (AKI) in hospitalized patients are a major barrier to the development of effective interventions to treat AKI. A recent study by Tomasev and colleagues at DeepMind described a model that achieved a state-of-the-art performance in predicting AKI up to 48 hours in advance.¹ Because this model was trained in a population of US Veterans that was 94% male, questions have arisen about its reproducibility and generalizability. In this study, we aimed to reproduce key aspects of this model, trained and evaluated it in a similar population of US Veterans, and evaluated its generalizability in a large academic hospital setting. We found that the model performed worse in predicting AKI in females in both populations, with miscalibration in lower stages of AKI and worse discrimination (a lower area under the curve) in higher stages of AKI. We demonstrate that while this discrepancy in performance can be largely corrected in non-Veterans by updating the original model using data from a sex-balanced academic hospital cohort, the worse model performance persists in Veterans. Our study sheds light on the importance of reproducing artificial intelligence studies, and on the complexity of discrepancies in model performance in subgroups that cannot be explained simply on the basis of sample size.

Grants and funding

R01 DK133226/DK/NIDDK NIH HHS/United States