Hospital-wide natural language processing summarising the health data of 1 million patients

Daniel M Bean; Zeljko Kraljevic; Anthony Shek; James Teo; Richard J B Dobson

doi:10.1371/journal.pdig.0000218

Hospital-wide natural language processing summarising the health data of 1 million patients

PLOS Digit Health. 2023 May 9;2(5):e0000218. doi: 10.1371/journal.pdig.0000218. eCollection 2023 May.

Authors

Daniel M Bean^{1

2}, Zeljko Kraljevic^{1

3}, Anthony Shek^{1

4}, James Teo^{4

5}, Richard J B Dobson^{1

2

3

6

7}

Affiliations

¹ Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom.
² Health Data Research UK London, University College London, London, United Kingdom.
³ NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, United Kingdom.
⁴ Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom.
⁵ Department of Neuroscience, King's College Hospital NHS Foundation Trust, London, United Kingdom.
⁶ Institute for Health Informatics, University College London, London, United Kingdom.
⁷ NIHR Biomedical Research Centre, University College London Hospitals NHS Foundation Trust, London, United Kingdom.

Abstract

Electronic health records (EHRs) represent a major repository of real world clinical trajectories, interventions and outcomes. While modern enterprise EHR's try to capture data in structured standardised formats, a significant bulk of the available information captured in the EHR is still recorded only in unstructured text format and can only be transformed into structured codes by manual processes. Recently, Natural Language Processing (NLP) algorithms have reached a level of performance suitable for large scale and accurate information extraction from clinical text. Here we describe the application of open-source named-entity-recognition and linkage (NER+L) methods (CogStack, MedCAT) to the entire text content of a large UK hospital trust (King's College Hospital, London). The resulting dataset contains 157M SNOMED concepts generated from 9.5M documents for 1.07M patients over a period of 9 years. We present a summary of prevalence and disease onset as well as a patient embedding that captures major comorbidity patterns at scale. NLP has the potential to transform the health data lifecycle, through large-scale automation of a traditionally manual task.

Copyright: © 2023 Bean et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Grants and funding

MR/S00310X/1/MRC_/Medical Research Council/United Kingdom