A Novel Automated Framework for QSAR Modeling of Highly Imbalanced Leishmania High-Throughput Screening Data

Omar Casanova-Alvarez; Aliuska Morales-Helguera; Miguel Ángel Cabrera-Pérez; Reinaldo Molina-Ruiz; Christophe Molina

doi:10.1021/acs.jcim.0c01439

A Novel Automated Framework for QSAR Modeling of Highly Imbalanced Leishmania High-Throughput Screening Data

J Chem Inf Model. 2021 Jul 26;61(7):3213-3231. doi: 10.1021/acs.jcim.0c01439. Epub 2021 Jun 30.

Authors

Omar Casanova-Alvarez¹, Aliuska Morales-Helguera², Miguel Ángel Cabrera-Pérez², Reinaldo Molina-Ruiz², Christophe Molina³

Affiliations

¹ Departamento de Química, Facultad de Química-Farmacia, Universidad Central "Marta Abreu" de Las Villas, Santa Clara, Villa Clara 54830, Cuba.
² Centro de Bioactivos Químicos, Universidad Central "Marta Abreu" de Las Villas, Santa Clara, Villa Clara 54830, Cuba.
³ PIKAÏROS S.A., B03 - 2 Allée de la Clairière, 31650 Saint Orens de Gameville, France.

PMID: 34191520
DOI: 10.1021/acs.jcim.0c01439

Abstract

In silico prediction of antileishmanial activity using quantitative structure-activity relationship (QSAR) models has been developed on limited and small datasets. Nowadays, the availability of large and diverse high-throughput screening data provides an opportunity to the scientific community to model this activity from the chemical structure. In this study, we present the first KNIME automated workflow to modeling a large, diverse, and highly imbalanced dataset of compounds with antileishmanial activity. Because the data is strongly biased toward inactive compounds, a novel strategy was implemented based on the selection of different balanced training sets and a further consensus model using single decision trees as the base model and three criteria for output combinations. The decision tree consensus was adopted after comparing its classification performance to consensuses built upon Gaussian-Naı̈ve-Bayes, Support-Vector-Machine, Random-Forest, Gradient-Boost, and Multi-Layer-Perceptron base models. All these consensuses were rigorously validated using internal and external test validation sets and were compared against each other using Friedman and Bonferroni-Dunn statistics. For the retained decision tree-based consensus model, which covers 100% of the chemical space of the dataset and with the lowest consensus level, the overall accuracy statistics for test and external sets were between 71 and 74% and 71 and 76%, respectively, while for a reduced chemical space (21%) and with an incremental consensus level, the accuracy statistics were substantially improved with values for the test and external sets between 86 and 92% and 88 and 92%, respectively. These results highlight the relevance of the consensus model to prioritize a relatively small set of active compounds with high prediction sensitivity using the Incremental Consensus at high level values or to predict as many compounds as possible, lowering the level of Incremental Consensus. Finally, the workflow developed eliminates human bias, improves the procedure reproducibility, and allows other researchers to reproduce our design and use it in their own QSAR problems.

MeSH terms

Bayes Theorem
High-Throughput Screening Assays
Humans
Leishmania*
Quantitative Structure-Activity Relationship*
Reproducibility of Results