Development of an algorithm for finding pertussis episodes in a population-based electronic health record database

Chathuri Daluwatte; Maryia Dvaretskaya; Sam Ekhtiari; Paul Hayat; Martin Montmerle; Sachin Mathur; Denis Macina

doi:10.1080/21645515.2023.2209455

Development of an algorithm for finding pertussis episodes in a population-based electronic health record database

Hum Vaccin Immunother. 2023 Dec 31;19(1):2209455. doi: 10.1080/21645515.2023.2209455.

Authors

Chathuri Daluwatte¹, Maryia Dvaretskaya², Sam Ekhtiari², Paul Hayat², Martin Montmerle², Sachin Mathur³, Denis Macina⁴

Affiliations

¹ Digital Data, Sanofi US Services, Inc, Cambridge, MA, USA.
² Healthcare, Quinten SAS, Paris, France.
³ Digital R&D, Sanofi US Services, Inc, Cambridge, MA, USA.
⁴ Global Medical, PPH Franchise, Sanofi, Lyon, France.

Abstract

While tetanus-diphtheria-acellular pertussis (Tdap) vaccines for adolescents and adults were licensed in 2005 and immunization strategies proposed, the burden of pertussis in this population remains under-recognized mainly due to atypical disease presentation, undermining efforts to optimize protection through vaccination. We developed a machine learning algorithm to identify undiagnosed/misdiagnosed pertussis episodes in patients diagnosed with acute respiratory disease (ARD) using signs, diseases and symptoms from clinician notes and demographic information within electronic health-care records (Optum Humedica repository [2007-2019]). We used two patient cohorts aged ≥11 years to develop the model: a positive pertussis cohort (4,515 episodes in 4,316 patients) and a negative pertussis (ARD) cohort (4,573,445 episodes and patients), defined using ICD 9/10 codes. To improve contrast between positive pertussis and negative pertussis (ARD) episodes, only episodes with ≥7 symptoms were selected. LightGBM was used as the machine learning model for pertussis episode identification. Model validity was determined using laboratory-confirmed pertussis positive and negative cohorts. Model explainability was obtained using the Shapley additive explanations method. The predictive performance was as follows: area under the precision-recall curve, 0.24 (SD, 7 × 10^-3); recall, 0.72 (SD, 4 × 10^-3); precision, 0.012 (SD, 1 × 10^-3); and specificity, 0.94 (SD, 7 × 10^-3). The model applied to laboratory-confirmed positive and negative pertussis episodes had a specificity of 0.846. Predictive probability for pertussis increased with presence of whooping cough, whoop, and post-tussive vomiting in clinician notes, but decreased with gastrointestinal bleeding, sepsis, pulmonary symptoms, and fever. In conclusion, machine learning can help identify pertussis episodes among those diagnosed with ARD.

Keywords: Algorithms; Bordetella pertussis; diagnosis; electronic health record; machine learning; predictive modeling.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Adolescent
Adult
Diphtheria* / prevention & control
Diphtheria-Tetanus-acellular Pertussis Vaccines*
Electronic Health Records
Humans
Tetanus* / prevention & control
Vaccination
Whooping Cough* / diagnosis
Whooping Cough* / epidemiology
Whooping Cough* / prevention & control

Substances

Diphtheria-Tetanus-acellular Pertussis Vaccines

Grants and funding

This study was funded and sponsored by Sanofi. Sanofi was involved in the study design, accessing the electronic health-care records database, analysis, and interpretation of data, the writing of the report; and in the decision to submit the paper for publication. All authors had full access to all the data in the study and had final responsibility for the decision to submit for publication.