Undersampling: case studies of flaviviral inhibitory activities

J Comput Aided Mol Des. 2019 Nov;33(11):997-1008. doi: 10.1007/s10822-019-00255-3. Epub 2019 Nov 26.

Abstract

Imbalanced datasets, comprising of more inactive compounds relative to the active ones, are a common challenge in ligand-based model building workflows for drug discovery. This is particularly true for neglected tropical diseases since efforts to identify therapeutics for these diseases are often limited. In this report, we analyze the performance of several undersampling strategies in modeling the Dengue Virus 2 (DENV2) inhibitory activity, as well as the anti-flaviviral activities for the West Nile (WNV) and Zika (ZIKV) viruses. To this end, we build datasets comprising of 1218 (159 actives and 1059 inactives), 1044 (132 actives and 912 inactives) and 302 (75 actives and 227 inactives) molecules with known DENV2, WNV and ZIKV inhibitory activity profiles, respectively. We develop ensemble classifiers for these endpoints and compare the performance of the different undersampling algorithms on external sets. It is observed that data pruning algorithms yield superior performance relative to data selection algorithms. The best overall performance is provided by the one-sided selection algorithm with test set balanced accuracy (BACC) values of 0.84, 0.74 and 0.77 for the DENV2, WNV and ZIKV inhibitory activities, respectively. For the model building, we use the recently proposed GT-STAF information indices, and compare the predictivity of 3 molecular fragmentation approaches: connected subgraphs, substructure and alogp atom types, which are observed to show comparable performance. On the other hand, a combination of indices based on these fragmentation strategies enhances the predictivity of the built ensembles. The built models could be useful for screening new molecules with possible DENV, WNV and ZIKV inhibitory activities. ADMET modelers are encouraged to adopt undersampling algorithms in their workflows when dealing with imbalanced datasets.

Keywords: Dengue virus; Information index; Support vector machine; Undersampling; West nile virus; Zika virus.

MeSH terms

  • Antiviral Agents / chemistry
  • Antiviral Agents / pharmacology*
  • Dengue Virus / drug effects
  • Drug Discovery / methods*
  • Flaviviridae / drug effects*
  • Flaviviridae Infections / drug therapy
  • Humans
  • Support Vector Machine*
  • West Nile virus / drug effects
  • Zika Virus / drug effects

Substances

  • Antiviral Agents