Prototype Selection Method Based on the Rivality and Reliability Indexes for the Improvement of the Classification Models and External Predictions

J Chem Inf Model. 2020 Jun 22;60(6):3009-3021. doi: 10.1021/acs.jcim.0c00176. Epub 2020 Apr 26.

Abstract

Prototype or instance selection techniques is an important field of research in knowledge discovery, data mining, and machine learning. In QSAR, the use of prototype selection techniques in the preprocessing stage of the construction of the QSAR models favors the data set curation, improving the interpretability and accuracy of the models as well as the performance of the algorithms. In this paper, we propose an efficient method for prototype selection to be used in the preprocessing stage of the construction of QSAR classification models. The proposed method is able to generate very high reduction rates in the cardinality of the training set while maintaining or even increasing the accuracy of the classification models. The validation of the method has been carried out by means of the prediction of external molecules, demonstrating that the prediction of new molecules is also maintained or even improved. The method has been tested using 40 benchmark data sets of different sizes and balancing ratios; the results of the tests have demonstrated the wide applicability domain of the proposed method.

MeSH terms

  • Algorithms
  • Data Mining
  • Machine Learning*
  • Quantitative Structure-Activity Relationship*
  • Reproducibility of Results