QSAR modeling of imbalanced high-throughput screening data in PubChem

J Chem Inf Model. 2014 Mar 24;54(3):705-12. doi: 10.1021/ci400737s. Epub 2014 Feb 28.

Abstract

Many of the structures in PubChem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced PubChem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and "biological" descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services ( http://cactus.nci.nih.gov/chemical/apps/cap).

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Databases, Chemical
  • Drug Evaluation, Preclinical / methods*
  • HEK293 Cells
  • High-Throughput Screening Assays / methods*
  • Humans
  • Models, Biological
  • Quantitative Structure-Activity Relationship*
  • Small Molecule Libraries / chemistry*
  • Small Molecule Libraries / pharmacology*
  • Software

Substances

  • Small Molecule Libraries