Regression Modelability Index: A New Index for Prediction of the Modelability of Data Sets in the Development of QSAR Regression Models

J Chem Inf Model. 2018 Oct 22;58(10):2069-2084. doi: 10.1021/acs.jcim.8b00313. Epub 2018 Sep 25.

Abstract

Prediction of the capability of a data set to be modeled by a statistical algorithm in the development of quantitative structure-activity relationship (QSAR) regression models is an important issue that allows researchers to avoid unnecessary tasks, wasted time, and/or the need to depurate the molecule composition of the data set in order to achieve an improvement of the model's accuracy. In this paper, we propose and formulate a new index that correlates with the performance of QSAR models. This index, the regression modelability index, requires very low computational cost and is based on the rivality between the nearest neighbors of the molecules in the data set. This rivality allows measurement of the capability of each molecule of the data set to be correctly predicted by a regression algorithm. In this study, using 40 data sets with very different characteristics regarding the number of molecules and activity values, we prove the high correlation between the proposed regression modelability index and the correlation coefficient in cross-validation ( Q2), reaching r2 values of 0.8. In addition, we describe the ability of this index to discover the outliers detected by the regression algorithms, allowing easy data set depuration in the first stages of the construction of QSAR regression models.

MeSH terms

  • Algorithms
  • Computer Simulation
  • Drug Discovery / methods*
  • Models, Molecular
  • Molecular Structure
  • Quantitative Structure-Activity Relationship
  • Software*