Introducing the BlendedICU dataset, the first harmonized, international intensive care dataset

Matthieu Oliver; Jérôme Allyn; Rémi Carencotte; Nicolas Allou; Cyril Ferdynus

doi:10.1016/j.jbi.2023.104502

Introducing the BlendedICU dataset, the first harmonized, international intensive care dataset

J Biomed Inform. 2023 Oct:146:104502. doi: 10.1016/j.jbi.2023.104502. Epub 2023 Sep 27.

Authors

Matthieu Oliver¹, Jérôme Allyn², Rémi Carencotte³, Nicolas Allou², Cyril Ferdynus⁴

Affiliations

¹ Methodological Support Unit, Reunion University Hospital, La Réunion, France; Clinical Informatics Department, Reunion University Hospital, La Réunion, France. Electronic address: matthieu.oliver@chu-reunion.fr.
² Methodological Support Unit, Reunion University Hospital, La Réunion, France; Intensive Care Unit, Reunion University Hospital, La Réunion, France; Clinical Informatics Department, Reunion University Hospital, La Réunion, France.
³ Methodological Support Unit, Reunion University Hospital, La Réunion, France; Clinical Informatics Department, Reunion University Hospital, La Réunion, France.
⁴ Methodological Support Unit, Reunion University Hospital, La Réunion, France; Clinical Informatics Department, Reunion University Hospital, La Réunion, France; Clinical Research Department, INSERM CIC1410, F-97410, La Réunion, France.

PMID: 37769828
DOI: 10.1016/j.jbi.2023.104502

Abstract

Objective: This study introduces the BlendedICU dataset, a massive dataset of international intensive care data. This dataset aims to facilitate generalizability studies of machine learning models, as well as statistical studies of clinical practices in the intensive care units.

Methods: Four publicly available and patient-level intensive care databases were used as source databases. A unique and customizable preprocessing pipeline extracted clinically relevant patient-related variables from each source database. The variables were then harmonized and standardized to the Observational Medical Outcomes Partnership (OMOP) Common Data Format. Finally, a brief comparison was carried out to explore differences in the source databases.

Results: The BlendedICU dataset features 41 timeseries variables as well as the exposure times to 113 active ingredients extracted from the AmsterdamUMCdb, eICU, HiRID, and MIMIC-IV databases. This resulted in a database of more than 309000 intensive care admissions, spanning over 13 years and three countries. We found that data collection, drug exposure, and patient outcomes varied strongly between source databases.

Conclusion: The variability in data collection, drug exposure, and patient outcomes between the source databases indicated some dissimilarity in patient phenotypes and clinical practices between different intensive care units. This demonstrated the need for generalizability studies of machine learning models. This study provides the clinical data research community with essential data to build efficient and generalizable machine learning models, as well as to explore clinical practices in intensive care units around the world.

Keywords: Data integration; Intensive care unit database; OMOP common data format.