Comparative analysis of classification techniques for topic-based biomedical literature categorisation

Ihor Stepanov; Arsentii Ivasiuk; Oleksandr Yavorskyi; Alina Frolova

doi:10.3389/fgene.2023.1238140

Comparative analysis of classification techniques for topic-based biomedical literature categorisation

Front Genet. 2023 Nov 7:14:1238140. doi: 10.3389/fgene.2023.1238140. eCollection 2023.

Authors

Ihor Stepanov^#^{1

2}, Arsentii Ivasiuk^#^{1

3}, Oleksandr Yavorskyi⁴, Alina Frolova^{2

5}

Affiliations

¹ Knowledgator Engineering Ltd., London, United Kingdom.
² Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine.
³ Bogomoletz Institute of Physiology, London, Ukraine.
⁴ National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv, Ukraine.
⁵ Department of Mathematics, Kyiv Academic University, Kyiv, Ukraine.

^# Contributed equally.

Abstract

Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly qualified professionals. Our research focused on the domain-specific articles classification to determine whether they contain information about drug-induced liver injury (DILI). DILI is a clinically significant condition and one of the reasons for drug registration failures. The rapid and accurate identification of drugs that may cause such conditions can prevent side effects in millions of patients. Methods: Developing a text classification method can help regulators, such as the FDA, much faster at a massive scale identify facts of potential DILI of concrete drugs. In our study, we compared several text classification methodologies, including transformers, LSTMs, information theory, and statistics-based methods. We devised a simple and interpretable text classification method that is as fast as Naïve Bayes while delivering superior performance for topic-oriented text categorisation. Moreover, we revisited techniques and methodologies to handle the imbalance of the data. Results: Transformers achieve the best results in cases if the distribution of classes and semantics of test data matches the training set. But in cases of imbalanced data, simple statistical-information theory-based models can surpass complex transformers, bringing more interpretable results that are so important for the biomedical domain. As our results show, neural networks can achieve better results if they are pre-trained on domain-specific data, and the loss function was designed to reflect the class distribution. Discussion: Overall, transformers are powerful architecture, however, in certain cases, such as topic classification, its usage can be redundant and simple statistical approaches can achieve compatible results while being much faster and explainable. However, we see potential in combining results from both worlds. Development of new neural network architectures, loss functions and training procedures that bring stability to unbalanced data is a promising topic of development.

Keywords: DILI; LSTM; biomedical literature classification; information theory; machine learning; text mining; transformer-based methods; unbalanced data.