Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries

J Comput Aided Mol Des. 2011 May;25(5):455-67. doi: 10.1007/s10822-011-9431-3. Epub 2011 May 10.

Abstract

Various in vitro and in-silico methods have been used for drug genotoxicity tests, which show limited genotoxicity (GT+) and non-genotoxicity (GT-) identification rates. New methods and combinatorial approaches have been explored for enhanced collective identification capability. The rates of in-silco methods may be further improved by significantly diversified training data enriched by the large number of recently reported GT+ and GT- compounds, but a major concern is the increased noise levels arising from high false-positive rates of in vitro data. In this work, we evaluated the effect of training data size and noise level on the performance of support vector machines (SVM) method known to tolerate high noise levels in training data. Two SVMs of different diversity/noise levels were developed and tested. H-SVM trained by higher diversity higher noise data (GT+ in any in vivo or in vitro test) outperforms L-SVM trained by lower noise lower diversity data (GT+ in in vivo or Ames test only). H-SVM trained by 4,763 GT+ compounds reported before 2008 and 8,232 GT- compounds excluding clinical trial drugs correctly identified 81.6% of the 38 GT+ compounds reported since 2008, predicted 83.1% of the 2,008 clinical trial drugs as GT-, and 23.96% of 168 K MDDR and 27.23% of 17.86M PubChem compounds as GT+. These are comparable to the 43.1-51.9% GT+ and 75-93% GT- rates of existing in-silico methods, 58.8% GT+ and 79% GT- rates of Ames method, and the estimated percentages of 23% in vivo and 31-33% in vitro GT+ compounds in the "universe of chemicals". There is a substantial level of agreement between H-SVM and L-SVM predicted GT+ and GT- MDDR compounds and the prediction from TOPKAT. SVM showed good potential in identifying GT+ compounds from large compound libraries based on higher diversity and higher noise training data.

MeSH terms

  • Artifacts
  • Artificial Intelligence
  • Computational Biology*
  • DNA Damage / genetics
  • Databases, Factual
  • Drug Evaluation, Preclinical / methods*
  • Drug-Related Side Effects and Adverse Reactions
  • High-Throughput Screening Assays
  • Models, Chemical*
  • Mutagenicity Tests / instrumentation*
  • Pharmaceutical Preparations
  • Small Molecule Libraries / analysis
  • Small Molecule Libraries / chemistry*
  • User-Computer Interface

Substances

  • Pharmaceutical Preparations
  • Small Molecule Libraries