Assessment of Natural Language Processing of Electronic Health Records to Measure Goals-of-Care Discussions as a Clinical Trial Outcome

JAMA Netw Open. 2023 Mar 1;6(3):e231204. doi: 10.1001/jamanetworkopen.2023.1204.

Abstract

Importance: Many clinical trial outcomes are documented in free-text electronic health records (EHRs), making manual data collection costly and infeasible at scale. Natural language processing (NLP) is a promising approach for measuring such outcomes efficiently, but ignoring NLP-related misclassification may lead to underpowered studies.

Objective: To evaluate the performance, feasibility, and power implications of using NLP to measure the primary outcome of EHR-documented goals-of-care discussions in a pragmatic randomized clinical trial of a communication intervention.

Design, setting, and participants: This diagnostic study compared the performance, feasibility, and power implications of measuring EHR-documented goals-of-care discussions using 3 approaches: (1) deep-learning NLP, (2) NLP-screened human abstraction (manual verification of NLP-positive records), and (3) conventional manual abstraction. The study included hospitalized patients aged 55 years or older with serious illness enrolled between April 23, 2020, and March 26, 2021, in a pragmatic randomized clinical trial of a communication intervention in a multihospital US academic health system.

Main outcomes and measures: Main outcomes were natural language processing performance characteristics, human abstractor-hours, and misclassification-adjusted statistical power of methods of measuring clinician-documented goals-of-care discussions. Performance of NLP was evaluated with receiver operating characteristic (ROC) curves and precision-recall (PR) analyses and examined the effects of misclassification on power using mathematical substitution and Monte Carlo simulation.

Results: A total of 2512 trial participants (mean [SD] age, 71.7 [10.8] years; 1456 [58%] female) amassed 44 324 clinical notes during 30-day follow-up. In a validation sample of 159 participants, deep-learning NLP trained on a separate training data set identified patients with documented goals-of-care discussions with moderate accuracy (maximal F1 score, 0.82; area under the ROC curve, 0.924; area under the PR curve, 0.879). Manual abstraction of the outcome from the trial data set would require an estimated 2000 abstractor-hours and would power the trial to detect a risk difference of 5.4% (assuming 33.5% control-arm prevalence, 80% power, and 2-sided α = .05). Measuring the outcome by NLP alone would power the trial to detect a risk difference of 7.6%. Measuring the outcome by NLP-screened human abstraction would require 34.3 abstractor-hours to achieve estimated sensitivity of 92.6% and would power the trial to detect a risk difference of 5.7%. Monte Carlo simulations corroborated misclassification-adjusted power calculations.

Conclusions and relevance: In this diagnostic study, deep-learning NLP and NLP-screened human abstraction had favorable characteristics for measuring an EHR outcome at scale. Adjusted power calculations accurately quantified power loss from NLP-related misclassification, suggesting that incorporation of this approach into the design of studies using NLP would be beneficial.

Publication types

  • Multicenter Study
  • Pragmatic Clinical Trial
  • Randomized Controlled Trial
  • Research Support, Non-U.S. Gov't
  • Research Support, N.I.H., Extramural

MeSH terms

  • Aged
  • Clinical Trials as Topic*
  • Computer Simulation
  • Data Collection* / methods
  • Deep Learning
  • Electronic Health Records*
  • Feasibility Studies
  • Female
  • Hospitalization
  • Humans
  • Male
  • Middle Aged
  • Natural Language Processing*
  • Patient Care Planning*