Large-scale investigation of weakly-supervised deep learning for the fine-grained semantic indexing of biomedical literature

Anastasios Nentidis; Thomas Chatzopoulos; Anastasia Krithara; Grigorios Tsoumakas; Georgios Paliouras

doi:10.1016/j.jbi.2023.104499

Large-scale investigation of weakly-supervised deep learning for the fine-grained semantic indexing of biomedical literature

J Biomed Inform. 2023 Oct:146:104499. doi: 10.1016/j.jbi.2023.104499. Epub 2023 Sep 14.

Authors

Anastasios Nentidis¹, Thomas Chatzopoulos², Anastasia Krithara³, Grigorios Tsoumakas⁴, Georgios Paliouras³

Affiliations

¹ Institute of Informatics and Telecommunications, NCSR Demokritos, Athens, Greece; School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece. Electronic address: nentidis@csd.auth.gr.
² Institute of Informatics and Telecommunications, NCSR Demokritos, Athens, Greece; Department of Computer Engineering and Informatics, University of Patras, Patras, Greece.
³ Institute of Informatics and Telecommunications, NCSR Demokritos, Athens, Greece.
⁴ School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece.

PMID: 37714418
DOI: 10.1016/j.jbi.2023.104499

Abstract

Objective: Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors with several related but distinct biomedical concepts often grouped together and treated as a single topic. This study proposes a new method for the automated refinement of subject annotations at the level of MeSH concepts.

Methods: Lacking labelled data, we rely on weak supervision based on concept occurrence in the abstract of an article, which is also enhanced by dictionary-based heuristics. In addition, we investigate deep learning approaches, making design choices to tackle the particular challenges of this task. The new method is evaluated on a large-scale retrospective scenario, based on concepts that have been promoted to descriptors.

Results: In our experiments concept occurrence was the strongest heuristic achieving a macro-F1 score of about 0.63 across several labels. The proposed method improved it further by more than 4pp.

Conclusion: The results suggest that concept occurrence is a strong heuristic for refining the coarse-grained labels at the level of MeSH concepts and the proposed method improves it further.

Keywords: Biomedical literature; Deep learning; Medical Subject Headings (MeSH); Semantic indexing; Weak supervision.