Active semi-supervised learning for biological data classification

PLoS One. 2020 Aug 19;15(8):e0237428. doi: 10.1371/journal.pone.0237428. eCollection 2020.

Abstract

Due to datasets have continuously grown, efforts have been performed in the attempt to solve the problem related to the large amount of unlabeled data in disproportion to the scarcity of labeled data. Another important issue is related to the trade-off between the difficulty in obtaining annotations provided by a specialist and the need for a significant amount of annotated data to obtain a robust classifier. In this context, active learning techniques jointly with semi-supervised learning are interesting. A smaller number of more informative samples previously selected (by the active learning strategy) and labeled by a specialist can propagate the labels to a set of unlabeled data (through the semi-supervised one). However, most of the literature works neglect the need for interactive response times that can be required by certain real applications. We propose a more effective and efficient active semi-supervised learning framework, including a new active learning method. An extensive experimental evaluation was performed in the biological context (using the ALL-AML, Escherichia coli and PlantLeaves II datasets), comparing our proposals with state-of-the-art literature works and different supervised (SVM, RF, OPF) and semi-supervised (YATSI-SVM, YATSI-RF and YATSI-OPF) classifiers. From the obtained results, we can observe the benefits of our framework, which allows the classifier to achieve higher accuracies more quickly with a reduced number of annotated samples. Moreover, the selection criterion adopted by our active learning method, based on diversity and uncertainty, enables the prioritization of the most informative boundary samples for the learning process. We obtained a gain of up to 20% against other learning techniques. The active semi-supervised learning approaches presented a better trade-off (accuracies and competitive and viable computational times) when compared with the active supervised learning ones.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Data Management / methods*
  • Supervised Machine Learning*

Grants and funding

This work was supported by National Council for Scientific and Technological Development - CNPq, http://www.cnpq.br/, grant numbers 431668/2016-7 (PTMS), 422811/2016-5 (PHB); Coordination for the Improvement of Higher Education Personnel - CAPES, http://www.capes.gov.br/, (GC); Fundação Araucária; SETI; and UTFPR. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.