Validation subset selections for extrapolation oriented QSPAR models

Csaba Szántai-Kis; István Kövesdi; György Kéri; László Orfi

doi:10.1023/b:modi.0000006538.99122.00

Validation subset selections for extrapolation oriented QSPAR models

Mol Divers. 2003;7(1):37-43. doi: 10.1023/b:modi.0000006538.99122.00.

Authors

Csaba Szántai-Kis¹, István Kövesdi, György Kéri, László Orfi

Affiliation

¹ Cooperative Research Center, Semmelweis University, Pf 131, Budapest 5, Hungary, 1367. szacsa@rezso.sote.hu

PMID: 14768902
DOI: 10.1023/b:modi.0000006538.99122.00

Abstract

One of the most important features of QSPAR models is their predictive ability. The predictive ability of QSPAR models should be checked by external validation. In this work we examined three different types of external validation set selection methods for their usefulness in in-silico screening. The usefulness of the selection methods was studied in such a way that: 1) We generated thousands of QSPR models and stored them in 'model banks'. 2) We selected a final top model from the model banks based on three different validation set selection methods. 3) We predicted large data sets, which we called 'chemical universe sets', and calculated the corresponding SEPs. The models were generated from small fractions of the available water solubility data during a GA Variable Subset Selection procedure. The external validation sets were constructed by random selections, uniformly distributed selections or by perimeter-oriented selections. We found that the best performing models on the perimeter-oriented external validation sets usually gave the best validation results when the remaining part of the available data was overwhelmingly large, i.e., when the model had to make a lot of extrapolations. We also compared the top final models obtained from external validation set selection methods in three independent and different sizes of 'chemical universe sets'.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computer Simulation
Models, Theoretical*
Predictive Value of Tests
Quantitative Structure-Activity Relationship*
Reproducibility of Results