Effect of Learning Dataset for Identification of Active Molecules: A Case Study of Integrin αIIbβ3 Inhibitors

Mol Inform. 2021 Jun;40(6):e2060040. doi: 10.1002/minf.202060040. Epub 2021 Mar 18.

Abstract

Efficient in silico approaches are needed to identify strong integrin αIIbβ3 inhibitors through a small number of measurements. To address the challenge, we investigated the effect of learning dataset on the classification performance of machine learning models focusing on weak and inactive compounds. The structure and activity information of the compounds were obtained from ChEMBL, and pCHEMBL values were used to classify them as active, inactive, or weak. Datasets with various imbalance levels from active:inactive=1 : 1 to 1 : 1000 were used for the machine learning. The prediction scores of the weak samples were found to lie between the predictive values of active and inactive compounds. In addition, another dataset that consists of 149 actives and 6.9 million inactives was screened; the results indicated that the number of positive predictions decreased for models trained with a higher number of inactives. Although there is a trade-off between false positives and false negatives, for determination of compounds with strong activity using a reduced number of measurements, it is better to use a large number of inactives for learning and identifying compounds that score higher than the weak samples.

Keywords: Machine learning; in-silico screening; integrin αIIbβ3.

MeSH terms

  • Platelet Glycoprotein GPIIb-IIIa Complex / antagonists & inhibitors*

Substances

  • Platelet Glycoprotein GPIIb-IIIa Complex