A Generic Semi-Supervised and Active Learning Framework for Biomedical Text Classification

Christopher A Flores; Rodrigo Verschae

doi:10.1109/EMBC48229.2022.9871846

A Generic Semi-Supervised and Active Learning Framework for Biomedical Text Classification

Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul:2022:4445-4448. doi: 10.1109/EMBC48229.2022.9871846.

Authors

Christopher A Flores, Rodrigo Verschae

PMID: 36085799
DOI: 10.1109/EMBC48229.2022.9871846

Abstract

Biomedical text classification requires having training examples labeled by clinical specialists, a process that can be costly. To address this problem, active learning incrementally selects a subset of the most informative unlabeled examples, samples that are then labeled and used to train a given classifier, seeking to reduce the number of labeled samples. Nonetheless, the other unlabeled examples are not used by active learning, but incorporating semi-supervised techniques that use unlabeled samples could improve the representativeness of the data and the discriminatory power of the classifiers. This work proposes a generic semi-supervised learning framework for improving active learning and reducing the number of labeled training examples in biomedical text classification. The proposed framework combines manually annotated training examples selected by active learning and pseudo-labels obtained from a trained classifier. To evaluate the proposed framework, three biomedical datasets with textual information on obesity and smoking habit were used across different classification algorithms. The classification results show that the proposed framework can reduce the number of training examples that are manually labeled by clinical specialists by a 10% without affecting the performance of the classifiers. This performance is attributable to the ability of the classifiers to correctly select and label the training examples. Clinical relevance- We demonstrate the effectiveness of the proposed semi-supervised learning framework to reduce manual labeling efforts of biomedical texts by clinical specialists for the training of classifiers.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Humans
Obesity
Problem-Based Learning*
Smoking
Supervised Machine Learning*