Shattering Distribution for Active Learning

IEEE Trans Neural Netw Learn Syst. 2022 Jan;33(1):215-228. doi: 10.1109/TNNLS.2020.3027605. Epub 2022 Jan 5.

Abstract

Active learning (AL) aims to maximize the learning performance of the current hypothesis by drawing as few labels as possible from an input distribution. Generally, most existing AL algorithms prune the hypothesis set via querying labels of unlabeled samples and could be deemed as a hypothesis-pruning strategy. However, this process critically depends on the initial hypothesis and its subsequent updates. This article presents a distribution-shattering strategy without an estimation of hypotheses by shattering the number density of the input distribution. For any hypothesis class, we halve the number density of an input distribution to obtain a shattered distribution, which characterizes any hypothesis with a lower bound on VC dimension. Our analysis shows that sampling in a shattered distribution reduces label complexity and error disagreement. With this paradigm guarantee, in an input distribution, a Shattered Distribution-based AL (SDAL) algorithm is derived to continuously split the shattered distribution into a number of representative samples. An empirical evaluation of benchmark data sets further verifies the effectiveness of the halving and querying abilities of SDAL in real-world AL tasks with limited labels. Experiments on active querying with adversarial examples and noisy labels further verify our theoretical insights on the performance disagreement of the hypothesis-pruning and distribution-shattering strategies. Our code is available at https://github.com/XiaofengCao-MachineLearning/Shattering-Distribution-for-Active-Learning.

Publication types

  • Research Support, Non-U.S. Gov't