Use of Machine Learning Tools in Evidence Synthesis of Tobacco Use Among Sexual and Gender Diverse Populations: Algorithm Development and Validation

Shaoying Ma; Shuning Jiang; Olivia Yang; Xuanzhi Zhang; Yu Fu; Yusen Zhang; Aadeeba Kaareen; Meng Ling; Jian Chen; Ce Shang

doi:10.2196/49031

Use of Machine Learning Tools in Evidence Synthesis of Tobacco Use Among Sexual and Gender Diverse Populations: Algorithm Development and Validation

JMIR Form Res. 2024 Jan 24:8:e49031. doi: 10.2196/49031.

Authors

Shaoying Ma^#¹, Shuning Jiang^#², Olivia Yang², Xuanzhi Zhang², Yu Fu², Yusen Zhang², Aadeeba Kaareen¹, Meng Ling², Jian Chen², Ce Shang¹

Affiliations

¹ Center for Tobacco Research, The Ohio State University Comprehensive Cancer Center, Columbus, OH, United States.
² Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, United States.

^# Contributed equally.

PMID: 38265858
PMCID: PMC10851114
DOI: 10.2196/49031

Abstract

Background: From 2016 to 2021, the volume of peer-reviewed publications related to tobacco has experienced a significant increase. This presents a considerable challenge in efficiently summarizing, synthesizing, and disseminating research findings, especially when it comes to addressing specific target populations, such as the LGBTQ+ (lesbian, gay, bisexual, transgender, queer, intersex, asexual, Two Spirit, and other persons who identify as part of this community) populations.

Objective: In order to expedite evidence synthesis and research gap discoveries, this pilot study has the following three aims: (1) to compile a specialized semantic database for tobacco policy research to extract information from journal article abstracts, (2) to develop natural language processing (NLP) algorithms that comprehend the literature on nicotine and tobacco product use among sexual and gender diverse populations, and (3) to compare the discoveries of the NLP algorithms with an ongoing systematic review of tobacco policy research among LGBTQ+ populations.

Methods: We built a tobacco research domain-specific semantic database using data from 2993 paper abstracts from 4 leading tobacco-specific journals, with enrichment from other publicly available sources. We then trained an NLP model to extract named entities after learning patterns and relationships between words and their context in text, which further enriched the semantic database. Using this iterative process, we extracted and assessed studies relevant to LGBTQ+ tobacco control issues, further comparing our findings with an ongoing systematic review that also focuses on evidence synthesis for this demographic group.

Results: In total, 33 studies were identified as relevant to sexual and gender diverse individuals' nicotine and tobacco product use. Consistent with the ongoing systematic review, the NLP results showed that there is a scarcity of studies assessing policy impact on this demographic using causal inference methods. In addition, the literature is dominated by US data. We found that the product drawing the most attention in the body of existing research is cigarettes or cigarette smoking and that the number of studies of various age groups is almost evenly distributed between youth or young adults and adults, consistent with the research needs identified by the US health agencies.

Conclusions: Our pilot study serves as a compelling demonstration of the capabilities of NLP tools in expediting the processes of evidence synthesis and the identification of research gaps. While future research is needed to statistically test the NLP tool's performance, there is potential for NLP tools to fundamentally transform the approach to evidence synthesis.

Keywords: LGBTQ+; bisexual; evidence synthesis; gay; lesbian; machine learning; natural language processing; queer; sexual and gender diverse populations; tobacco control; transgender.

©Shaoying Ma, Shuning Jiang, Olivia Yang, Xuanzhi Zhang, Yu Fu, Yusen Zhang, Aadeeba Kaareen, Meng Ling, Jian Chen, Ce Shang. Originally published in JMIR Formative Research (https://formative.jmir.org), 24.01.2024.

Grants and funding

R21 CA249757/CA/NCI NIH HHS/United States