Feature engineering from medical notes: A case study of dementia detection

Zina Ben Miled; Paul R Dexter; Randall W Grout; Malaz Boustani

doi:10.1016/j.heliyon.2023.e14636

Feature engineering from medical notes: A case study of dementia detection

Heliyon. 2023 Mar 18;9(3):e14636. doi: 10.1016/j.heliyon.2023.e14636. eCollection 2023 Mar.

Authors

Zina Ben Miled^{1

2}, Paul R Dexter^{3

2}, Randall W Grout³, Malaz Boustani^{3

2}

Affiliations

¹ Department of Electrical and Computer Engineering, School of Engineering and Technology, Indiana University Purdue University at Indianapolis, 723 W. Michigan Street, Indianapolis, IN, 46202, USA.
² Regenstrief Institute, Inc., 1101 W. 10th Street, Indianapolis, IN, 46202, USA.
³ Indiana University School of Medicine, 340 W 10th St, Indianapolis, IN, 46202, USA.

Abstract

Background and objectives: Medical notes are narratives that describe the health of the patient in free text format. These notes can be more informative than structured data such as the history of medications or disease conditions. They are routinely collected and can be used to evaluate the patient's risk for developing chronic diseases such as dementia. This study investigates different methodologies for transforming routine care notes into dementia risk classifiers and evaluates the generalizability of these classifiers to new patients and new health care institutions.

Methods: The notes collected over the relevant history of the patient are lengthy. In this study, TF-ICF is used to select keywords with the highest discriminative ability between at risk dementia patients and healthy controls. The medical notes are then summarized in the form of occurrences of the selected keywords. Two different encodings of the summary are compared. The first encoding consists of the average of the vector embedding of each keyword occurrence as produced by the BERT or Clinical BERT pre-trained language models. The second encoding aggregates the keywords according to UMLS concepts and uses each concept as an exposure variable. For both encodings, misspellings of the selected keywords are also considered in an effort to improve the predictive performance of the classifiers. A neural network is developed over the first encoding and a gradient boosted trees model is applied to the second encoding. Patients from a single health care institution are used to develop all the classifiers which are then evaluated on held-out patients from the same health care institution as well as test patients from two other health care institutions.

Results: The results indicate that it is possible to identify patients at risk for dementia one year ahead of the onset of the disease using medical notes with an AUC of 75% when a gradient boosted trees model is used in conjunction with exposure variables derived from UMLS concepts. However, this performance is not maintained with an embedded feature space and when the classifier is applied to patients from other health care institutions. Moreover, an analysis of the top predictors of the gradient boosted trees model indicates that different features inform the classification depending on whether or not spelling variants of the keywords are included.

Conclusion: The present study demonstrates that medical notes can enable risk prediction models for complex chronic diseases such as dementia. However, additional research efforts are needed to improve the generalizability of these models. These efforts should take into consideration the length and localization of the medical notes; the availability of sufficient training data for each disease condition; and the variabilities resulting from different feature engineering techniques.

Keywords: BERT; Clinical BERT; Dementia; EMR; Medical notes; UMLS.

Grants and funding

R01 AG069765/AG/NIA NIH HHS/United States