PubMed Author-assigned Keyword Extraction (PubMedAKE) Benchmark

Proc ACM Int Conf Inf Knowl Manag. 2022 Oct:2022:4470-4474. doi: 10.1145/3511808.3557675. Epub 2022 Oct 17.

Abstract

With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extraction dataset that contains the title, abstract, and keywords of over 843,269 articles from the PubMed open access subset database. This dataset, publicly available on Zenodo, is the largest keyword extraction benchmark with sufficient samples to train neural networks. Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature.

Keywords: PubMed literature; datasets; keyphrases extraction; keywords extraction.