Using a gradient boosted model for case ascertainment from free-text veterinary records

Uttara Kennedy; Mandy Paterson; Nicholas Clark

doi:10.1016/j.prevetmed.2023.105850

Using a gradient boosted model for case ascertainment from free-text veterinary records

Prev Vet Med. 2023 Mar:212:105850. doi: 10.1016/j.prevetmed.2023.105850. Epub 2023 Jan 10.

Authors

Uttara Kennedy¹, Mandy Paterson², Nicholas Clark³

Affiliations

¹ UQ School of Veterinary Science, The University of Queensland, Gatton, Queensland 4343, Australia; RSPCA Queensland, Animal Care Campus, 139 Wacol Station Road, Wacol, Queensland 4076, Australia. Electronic address: uttara.kennedy@uq.edu.au.
² UQ School of Veterinary Science, The University of Queensland, Gatton, Queensland 4343, Australia; RSPCA Queensland, Animal Care Campus, 139 Wacol Station Road, Wacol, Queensland 4076, Australia.
³ UQ School of Veterinary Science, The University of Queensland, Gatton, Queensland 4343, Australia.

PMID: 36638610
DOI: 10.1016/j.prevetmed.2023.105850

Abstract

Case ascertainment for prevalence and incidence studies from veterinary clinical data poses a major challenge because medical notes are not consistently structured or complete. Using natural language processing (NLP) and machine learning, this study aimed to obtain accurate case recognition for feline upper respiratory tract infections (primarily caused by viruses such as feline herpes virus (FHV-1) and feline calici virus (FCV), and bacteria such as Chlamydophila felis, Mycoplasma felis and Bordetella bronchiseptica using retrospective electronic veterinary records from the Royal Society for Prevention of Cruelty to Animals, Queensland (RSPCA Qld). Data cleaning and NLP on eight years of free-text veterinary records from RSPCA Queensland was carried out to derive text-based predictors. The NLP steps included sorting records by length of stay, vectorising, tokenising and spell checking against a bespoke veterinary database. A gradient boosted model (GBM) was trained to predict the probability of each animal having a diagnosis of upper respiratory infection. A manually annotated dataset was used for training the algorithm to learn dominant patterns between predictors (frequencies of n-grams) and responses (manual binary case classification). The GBM's performance was tested against an out of sample validation dataset, and model agnostics were used to interrogate the model's learning process. The GBM used patient-level frequencies of 1250 unique n-grams as predictor variables and was able to predict the probability of cases in the validation dataset with an accuracy of 0.95 (95% CI 0.92, 0.97) and F1 score of 0.96. Predictors that exerted the highest influence on the model included frequencies of "doxycycline", "flu", "sneezing", "doxybrom" and "ocular". The trained GBM was deployed on the full dataset spanning eight years, comprising 60,258 clinical entries. The prevalence in the full dataset was predicted to be 23.59%, which is in line with domain expertise from practicing veterinarians at the shelter. Case ascertainment is a crucial step for further epidemiological study of cat flu. Ultimately, this tool can be extended to other clinical procedures, conditions, and diseases such as intensive care treatment due to snake bites and tick paralysis, physical injuries such as orthopaedic fractures or chest injuries and labour-intensive infectious diseases like parvovirus, canine cough, and ringworm, all of which require prolonged quarantine and care.

Keywords: Bordetella bronchiseptica; Calici virus; Case ascertainment; Chlamydophila felis; Feline; Gradient boosted model; Herpes virus; Machine learning; Mycoplasma felis; Shelter.

MeSH terms

Animals
Calicivirus, Feline*
Cat Diseases* / epidemiology
Cats
Dog Diseases*
Dogs
Queensland / epidemiology
Respiratory Tract Infections* / epidemiology
Respiratory Tract Infections* / veterinary
Retrospective Studies