The experimental uncertainty of heterogeneous public K(i) data

Christian Kramer; Tuomo Kalliokoski; Peter Gedeck; Anna Vulpetti

doi:10.1021/jm300131x

The experimental uncertainty of heterogeneous public K(i) data

J Med Chem. 2012 Jun 14;55(11):5165-73. doi: 10.1021/jm300131x. Epub 2012 May 29.

Authors

Christian Kramer¹, Tuomo Kalliokoski, Peter Gedeck, Anna Vulpetti

Affiliation

¹ Novartis Institutes for BioMedical Research, Novartis Pharma AG, Forum 1, Novartis Campus, CH-4056 Basel, Switzerland. Christian.Kramer@novartis.com

PMID: 22643060
DOI: 10.1021/jm300131x

Abstract

The maximum achievable accuracy of in silico models depends on the quality of the experimental data. Consequently, experimental uncertainty defines a natural upper limit to the predictive performance possible. Models that yield errors smaller than the experimental uncertainty are necessarily overtrained. A reliable estimate of the experimental uncertainty is therefore of high importance to all originators and users of in silico models. The data deposited in ChEMBL was analyzed for reproducibility, i.e., the experimental uncertainty of independent measurements. Careful filtering of the data was required because ChEMBL contains unit-transcription errors, undifferentiated stereoisomers, and repeated citations of single measurements (90% of all pairs). The experimental uncertainty is estimated to yield a mean error of 0.44 pK(i) units, a standard deviation of 0.54 pK(i) units, and a median error of 0.34 pK(i) units. The maximum possible squared Pearson correlation coefficient (R(2)) on large data sets is estimated to be 0.81.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology / methods*
Computer Simulation*
Databases, Factual / statistics & numerical data*
Drug Discovery / methods*
Ligands
Molecular Structure
Proteins / chemistry
Quantitative Structure-Activity Relationship
Reproducibility of Results
Stereoisomerism
Uncertainty*

Substances

Ligands
Proteins