Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data

Health Informatics J. 2023 Jan-Mar;29(1):14604582221115667. doi: 10.1177/14604582221115667.

Abstract

Background/Objectives: Unsupervised topic models are often used to facilitate improved understanding of large unstructured clinical text datasets. In this study we investigated how ICD-9 diagnostic codes, collected alongside clinical text data, could be used to establish concurrent-, convergent- and discriminant-validity of learned topic models. Design/Setting: Retrospective open cohort design. Data were collected from primary care clinics located in Toronto, Canada between 01/01/2017 through 12/31/2020. Methods: We fit a non-negative matrix factorization topic model, with K = 50 latent topics/themes, to our input document term matrix (DTM). We estimated the magnitude of association between each Boolean-valued ICD-9 diagnostic code and each continuous latent topical vector. We identified ICD-9 diagnostic codes most strongly associated with each latent topical vector; and qualitatively interpreted how these codes could be used for external validation of the learned topic model. Results: The DTM consisted of 382,666 documents and 2210 words/tokens. We correlated concurrently assigned ICD-9 diagnostic codes with learned topical vectors, and observed semantic agreement for a subset of latent constructs (e.g. conditions of the breast, disorders of the female genital tract, respiratory disease, viral infection, eye/ear/nose/throat conditions, conditions of the urinary system, and dermatological conditions, etc.). Conclusions: When fitting topic models to clinical text corpora, researchers can leverage contemporaneously collected electronic medical record data to investigate the external validity of fitted latent variable models.

Keywords: ICD-9 codes; clinical text data; concurrent validity; convergent validity; discriminant validity; electronic medical record; external validation; non-negative matrix factorization; topic model.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Electronic Health Records*
  • Female
  • Humans
  • International Classification of Diseases*
  • Learning
  • Primary Health Care
  • Retrospective Studies

Grants and funding