QSAR Modeling Using Large-Scale Databases: Case Study for HIV-1 Reverse Transcriptase Inhibitors

J Chem Inf Model. 2015 Jul 27;55(7):1388-99. doi: 10.1021/acs.jcim.5b00019. Epub 2015 Jun 29.

Abstract

Large-scale databases are important sources of training sets for various QSAR modeling approaches. Generally, these databases contain information extracted from different sources. This variety of sources can produce inconsistency in the data, defined as sometimes widely diverging activity results for the same compound against the same target. Because such inconsistency can reduce the accuracy of predictive models built from these data, we are addressing the question of how best to use data from publicly and commercially accessible databases to create accurate and predictive QSAR models. We investigate the suitability of commercially and publicly available databases to QSAR modeling of antiviral activity (HIV-1 reverse transcriptase (RT) inhibition). We present several methods for the creation of modeling (i.e., training and test) sets from two, either commercially or freely available, databases: Thomson Reuters Integrity and ChEMBL. We found that the typical predictivities of QSAR models obtained using these different modeling set compilation methods differ significantly from each other. The best results were obtained using training sets compiled for compounds tested using only one method and material (i.e., a specific type of biological assay). Compound sets aggregated by target only typically yielded poorly predictive models. We discuss the possibility of "mix-and-matching" assay data across aggregating databases such as ChEMBL and Integrity and their current severe limitations for this purpose. One of them is the general lack of complete and semantic/computer-parsable descriptions of assay methodology carried by these databases that would allow one to determine mix-and-matchability of result sets at the assay level.

Publication types

  • Research Support, N.I.H., Intramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Databases, Pharmaceutical*
  • Drug Discovery
  • Drug Resistance, Viral
  • HIV Reverse Transcriptase / antagonists & inhibitors*
  • HIV-1 / drug effects
  • HIV-1 / enzymology*
  • Models, Statistical*
  • Quantitative Structure-Activity Relationship*
  • Reverse Transcriptase Inhibitors / chemistry*
  • Reverse Transcriptase Inhibitors / pharmacology*

Substances

  • Reverse Transcriptase Inhibitors
  • reverse transcriptase, Human immunodeficiency virus 1
  • HIV Reverse Transcriptase