Rule-based and machine learning algorithms identify patients with systemic sclerosis accurately in the electronic health record

Arthritis Res Ther. 2019 Dec 30;21(1):305. doi: 10.1186/s13075-019-2092-7.

Abstract

Background: Systemic sclerosis (SSc) is a rare disease with studies limited by small sample sizes. Electronic health records (EHRs) represent a powerful tool to study patients with rare diseases such as SSc, but validated methods are needed. We developed and validated EHR-based algorithms that incorporate billing codes and clinical data to identify SSc patients in the EHR.

Methods: We used a de-identified EHR with over 3 million subjects and identified 1899 potential SSc subjects with at least 1 count of the SSc ICD-9 (710.1) or ICD-10-CM (M34*) codes. We randomly selected 200 as a training set for chart review. A subject was a case if diagnosed with SSc by a rheumatologist, dermatologist, or pulmonologist. We selected the following algorithm components based on clinical knowledge and available data: SSc ICD-9 and ICD-10-CM codes, positive antinuclear antibody (ANA) (titer ≥ 1:80), and a keyword of Raynaud's phenomenon (RP). We performed both rule-based and machine learning techniques for algorithm development. Positive predictive values (PPVs), sensitivities, and F-scores (which account for PPVs and sensitivities) were calculated for the algorithms.

Results: PPVs were low for algorithms using only 1 count of the SSc ICD-9 code. As code counts increased, the PPVs increased. PPVs were higher for algorithms using ICD-10-CM codes versus the ICD-9 code. Adding a positive ANA and RP keyword increased the PPVs of algorithms only using ICD billing codes. Algorithms using ≥ 3 or ≥ 4 counts of the SSc ICD-9 or ICD-10-CM codes and ANA positivity had the highest PPV at 100% but a low sensitivity at 50%. The algorithm with the highest F-score of 91% was ≥ 4 counts of the ICD-9 or ICD-10-CM codes with an internally validated PPV of 90%. A machine learning method using random forests yielded an algorithm with a PPV of 84%, sensitivity of 92%, and F-score of 88%. The most important feature was RP keyword.

Conclusions: Algorithms using only ICD-9 codes did not perform well to identify SSc patients. The highest performing algorithms incorporated clinical data with billing codes. EHR-based algorithms can identify SSc patients across a healthcare system, enabling researchers to examine important outcomes.

Keywords: Algorithms; Bioinformatics; Electronic health records; Systemic sclerosis.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Adult
  • Aged
  • Aged, 80 and over
  • Algorithms*
  • Databases, Factual / standards
  • Databases, Factual / statistics & numerical data
  • Electronic Health Records / standards*
  • Electronic Health Records / statistics & numerical data
  • Female
  • Humans
  • International Classification of Diseases / standards*
  • Machine Learning*
  • Male
  • Middle Aged
  • Reproducibility of Results
  • Scleroderma, Systemic / diagnosis*
  • Sensitivity and Specificity