Data Veracity of Patients and Health Consumers Reported Adverse Drug Reactions on Twitter: Key Linguistic Features, Twitter Variables, and Association Rules

Stud Health Technol Inform. 2022 Jun 6:290:552-556. doi: 10.3233/SHTI220138.

Abstract

As Twitter emerged as an important data source for pharmacovigilance, heterogeneous data veracity becomes a major concern for extracted adverse drug reactions (ADRs). Our objective is to categorize different levels of data veracity and explore linguistic features of tweets and Twitter variables as they may be used for automatic screening high-veracity tweets that contain ADR-related information. We annotated a published Twitter corpus with linguistic features from existing studies and clinical experts. Multinomial logistic regression models found that first-person pronouns, expressing negative sentiment, ADR and drug name being in the same sentence were significantly associated with higher levels of data veracity (p<0.05), using medical terminology and fewer indications were associated with good data veracity (p<0.05), less drug numbers were marginally associated with good data veracity (p=0.053). These findings suggest opportunities for developing machine learning models for automatic screening of ADR-related tweets using key linguistic features, Twitter variables, and association rules.

Keywords: Data Mining; Drug-Related Side Effects and Adverse Reactions; Pharmacovigilance.

MeSH terms

  • Drug-Related Side Effects and Adverse Reactions* / epidemiology
  • Humans
  • Linguistics
  • Machine Learning
  • Pharmacovigilance
  • Social Media*