Analyzing Suicide Risk From Linguistic Features in Social Media: Evaluation Study

Cecilia Lao; Jo Lane; Hanna Suominen

doi:10.2196/35563

Analyzing Suicide Risk From Linguistic Features in Social Media: Evaluation Study

JMIR Form Res. 2022 Aug 30;6(8):e35563. doi: 10.2196/35563.

Authors

Cecilia Lao¹, Jo Lane², Hanna Suominen^{1

3}

Affiliations

¹ School of Computing, College of Engineering and Computer Science, The Australian National University, Canberra, ACT, Australia.
² National Centre for Epidemiology and Population Health, College of Health and Medicine, The Australian National University, Canberra, ACT, Australia.
³ Department of Computing, Faculty of Technology, University of Turku, Turku, Finland.

PMID: 36040781
PMCID: PMC9472054
DOI: 10.2196/35563

Abstract

Background: Effective suicide risk assessments and interventions are vital for suicide prevention. Although assessing such risks is best done by health care professionals, people experiencing suicidal ideation may not seek help. Hence, machine learning (ML) and computational linguistics can provide analytical tools for understanding and analyzing risks. This, therefore, facilitates suicide intervention and prevention.

Objective: This study aims to explore, using statistical analyses and ML, whether computerized language analysis could be applied to assess and better understand a person's suicide risk on social media.

Methods: We used the University of Maryland Suicidality Dataset comprising text posts written by users (N=866) of mental health-related forums on Reddit. Each user was classified with a suicide risk rating (no, low, moderate, or severe) by either medical experts or crowdsourced annotators, denoting their estimated likelihood of dying by suicide. In language analysis, the Linguistic Inquiry and Word Count lexicon assessed sentiment, thinking styles, and part of speech, whereas readability was explored using the TextStat library. The Mann-Whitney U test identified differences between at-risk (low, moderate, and severe risk) and no-risk users. Meanwhile, the Kruskal-Wallis test and Spearman correlation coefficient were used for granular analysis between risk levels and to identify redundancy, respectively. In the ML experiments, gradient boost, random forest, and support vector machine models were trained using 10-fold cross validation. The area under the receiver operator curve and F₁-score were the primary measures. Finally, permutation importance uncovered the features that contributed the most to each model's decision-making.

Results: Statistically significant differences (P<.05) were identified between the at-risk (671/866, 77.5%) and no-risk groups (195/866, 22.5%). This was true for both the crowd- and expert-annotated samples. Overall, at-risk users had higher median values for most variables (authenticity, first-person pronouns, and negation), with a notable exception of clout, which indicated that at-risk users were less likely to engage in social posturing. A high positive correlation (ρ>0.84) was present between the part of speech variables, which implied redundancy and demonstrated the utility of aggregate features. All ML models performed similarly in their area under the curve (0.66-0.68); however, the random forest and gradient boost models were noticeably better in their F₁-score (0.65 and 0.62) than the support vector machine (0.52). The features that contributed the most to the ML models were authenticity, clout, and negative emotions.

Conclusions: In summary, our statistical analyses found linguistic features associated with suicide risk, such as social posturing (eg, authenticity and clout), first-person singular pronouns, and negation. This increased our understanding of the behavioral and thought patterns of social media users and provided insights into the mechanisms behind ML models. We also demonstrated the applicative potential of ML in assisting health care professionals to assess and manage individuals experiencing suicide risk.

Keywords: evaluation study; interdisciplinary research; linguistics; machine learning; mental health; natural language processing; social media; suicide risk.

©Cecilia Lao, Jo Lane, Hanna Suominen. Originally published in JMIR Formative Research (https://formative.jmir.org), 30.08.2022.