Feasibility of Active Machine Learning for Multiclass Compound Classification

J Chem Inf Model. 2016 Jan 25;56(1):12-20. doi: 10.1021/acs.jcim.5b00332. Epub 2016 Jan 7.

Abstract

A common task in the hit-to-lead process is classifying sets of compounds into multiple, usually structural classes, which build the groundwork for subsequent SAR studies. Machine learning techniques can be used to automate this process by learning classification models from training compounds of each class. Gathering class information for compounds can be cost-intensive as the required data needs to be provided by human experts or experiments. This paper studies whether active machine learning can be used to reduce the required number of training compounds. Active learning is a machine learning method which processes class label data in an iterative fashion. It has gained much attention in a broad range of application areas. In this paper, an active learning method for multiclass compound classification is proposed. This method selects informative training compounds so as to optimally support the learning progress. The combination with human feedback leads to a semiautomated interactive multiclass classification procedure. This method was investigated empirically on 15 compound classification tasks containing 86-2870 compounds in 3-38 classes. The empirical results show that active learning can solve these classification tasks using 10-80% of the data which would be necessary for standard learning techniques.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Drug Discovery / methods*
  • Feasibility Studies
  • Feedback
  • Humans
  • Supervised Machine Learning*