Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries

Pankaj Kumar; Xiaohua Ma; Xianghui Liu; Jia Jia; Han Bucong; Ying Xue; Ze Rong Li; Sheng Yong Yang; Yu Quan Wei; Yu Zong Chen

doi:10.1007/s10822-011-9431-3

Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries

J Comput Aided Mol Des. 2011 May;25(5):455-67. doi: 10.1007/s10822-011-9431-3. Epub 2011 May 10.

Authors

Pankaj Kumar¹, Xiaohua Ma, Xianghui Liu, Jia Jia, Han Bucong, Ying Xue, Ze Rong Li, Sheng Yong Yang, Yu Quan Wei, Yu Zong Chen

Affiliation

¹ Bioinformatics and Drug Design Group, Centre for Computational Science and Engineering, Department of Pharmacy, National University of Singapore.

PMID: 21556903
DOI: 10.1007/s10822-011-9431-3

Abstract

Various in vitro and in-silico methods have been used for drug genotoxicity tests, which show limited genotoxicity (GT+) and non-genotoxicity (GT-) identification rates. New methods and combinatorial approaches have been explored for enhanced collective identification capability. The rates of in-silco methods may be further improved by significantly diversified training data enriched by the large number of recently reported GT+ and GT- compounds, but a major concern is the increased noise levels arising from high false-positive rates of in vitro data. In this work, we evaluated the effect of training data size and noise level on the performance of support vector machines (SVM) method known to tolerate high noise levels in training data. Two SVMs of different diversity/noise levels were developed and tested. H-SVM trained by higher diversity higher noise data (GT+ in any in vivo or in vitro test) outperforms L-SVM trained by lower noise lower diversity data (GT+ in in vivo or Ames test only). H-SVM trained by 4,763 GT+ compounds reported before 2008 and 8,232 GT- compounds excluding clinical trial drugs correctly identified 81.6% of the 38 GT+ compounds reported since 2008, predicted 83.1% of the 2,008 clinical trial drugs as GT-, and 23.96% of 168 K MDDR and 27.23% of 17.86M PubChem compounds as GT+. These are comparable to the 43.1-51.9% GT+ and 75-93% GT- rates of existing in-silico methods, 58.8% GT+ and 79% GT- rates of Ames method, and the estimated percentages of 23% in vivo and 31-33% in vitro GT+ compounds in the "universe of chemicals". There is a substantial level of agreement between H-SVM and L-SVM predicted GT+ and GT- MDDR compounds and the prediction from TOPKAT. SVM showed good potential in identifying GT+ compounds from large compound libraries based on higher diversity and higher noise training data.

MeSH terms

Artifacts
Artificial Intelligence
Computational Biology*
DNA Damage / genetics
Databases, Factual
Drug Evaluation, Preclinical / methods*
Drug-Related Side Effects and Adverse Reactions
High-Throughput Screening Assays
Models, Chemical*
Mutagenicity Tests / instrumentation*
Pharmaceutical Preparations
Small Molecule Libraries / analysis
Small Molecule Libraries / chemistry*
User-Computer Interface

Substances

Pharmaceutical Preparations
Small Molecule Libraries