Data Veracity of Patients and Health Consumers Reported Adverse Drug Reactions on Twitter: Key Linguistic Features, Twitter Variables, and Association Rules

Tianchu Lyu; Andrew Eidson; Jungmi Jun; Xiajie Zhou; Xiang Cui; Chen Liang

doi:10.3233/SHTI220138

Data Veracity of Patients and Health Consumers Reported Adverse Drug Reactions on Twitter: Key Linguistic Features, Twitter Variables, and Association Rules

Stud Health Technol Inform. 2022 Jun 6:290:552-556. doi: 10.3233/SHTI220138.

Authors

Tianchu Lyu¹, Andrew Eidson², Jungmi Jun³, Xiajie Zhou⁴, Xiang Cui⁵, Chen Liang¹

Affiliations

¹ Department of Health Services Policy and Management, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina, USA.
² School of Medicine, University of South Carolina, Columbia, South Carolina, USA.
³ School of Journalism and Mass Communications, College of Information and Communications, University of South Carolina, Columbia, South Carolina, USA.
⁴ Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China.
⁵ Department of Epidemiology, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.

PMID: 35673077
DOI: 10.3233/SHTI220138

Abstract

As Twitter emerged as an important data source for pharmacovigilance, heterogeneous data veracity becomes a major concern for extracted adverse drug reactions (ADRs). Our objective is to categorize different levels of data veracity and explore linguistic features of tweets and Twitter variables as they may be used for automatic screening high-veracity tweets that contain ADR-related information. We annotated a published Twitter corpus with linguistic features from existing studies and clinical experts. Multinomial logistic regression models found that first-person pronouns, expressing negative sentiment, ADR and drug name being in the same sentence were significantly associated with higher levels of data veracity (p<0.05), using medical terminology and fewer indications were associated with good data veracity (p<0.05), less drug numbers were marginally associated with good data veracity (p=0.053). These findings suggest opportunities for developing machine learning models for automatic screening of ADR-related tweets using key linguistic features, Twitter variables, and association rules.

Keywords: Data Mining; Drug-Related Side Effects and Adverse Reactions; Pharmacovigilance.

MeSH terms

Drug-Related Side Effects and Adverse Reactions* / epidemiology
Humans
Linguistics
Machine Learning
Pharmacovigilance
Social Media*