The Classification of Short Scientific Texts Using Pretrained BERT Model

Gleb Danilov; Timur Ishankulov; Konstantin Kotik; Yuriy Orlov; Mikhail Shifrin; Alexander Potapov

doi:10.3233/SHTI210125

The Classification of Short Scientific Texts Using Pretrained BERT Model

Stud Health Technol Inform. 2021 May 27:281:83-87. doi: 10.3233/SHTI210125.

Authors

Gleb Danilov¹, Timur Ishankulov¹, Konstantin Kotik¹, Yuriy Orlov², Mikhail Shifrin¹, Alexander Potapov¹

Affiliations

¹ Laboratory of Biomedical Informatics and Artificial Intelligence, National Medical Research Center for Neurosurgery named after N.N. Burdenko, Moscow, Russian Federation.
² Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, Moscow, Russian Federation.

PMID: 34042710
DOI: 10.3233/SHTI210125

Abstract

Automated text classification is a natural language processing (NLP) technology that could significantly facilitate scientific literature selection. A specific topical dataset of 630 article abstracts was obtained from the PubMed database. We proposed 27 parametrized options of PubMedBERT model and 4 ensemble models to solve a binary classification task on that dataset. Three hundred tests with resamples were performed in each classification approach. The best PubMedBERT model demonstrated F1-score = 0.857 while the best ensemble model reached F1-score = 0.853. We concluded that the short scientific texts classification quality might be improved using the latest state-of-art approaches.

Keywords: Text classification; artificial intelligence; machine learning; natural language processing; neurosurgery; topic modeling.

MeSH terms

Natural Language Processing*
PubMed