Effect of Learning Dataset for Identification of Active Molecules: A Case Study of Integrin αIIbβ3 Inhibitors

Kentaro Kawai; Mami Tomonou; Yume Machida; Yukiko Karuo; Atsushi Tarui; Kazuyuki Sato; Yoshiki Ikeda; Tatsuo Kinashi; Masaaki Omote

doi:10.1002/minf.202060040

Effect of Learning Dataset for Identification of Active Molecules: A Case Study of Integrin αIIbβ3 Inhibitors

Mol Inform. 2021 Jun;40(6):e2060040. doi: 10.1002/minf.202060040. Epub 2021 Mar 18.

Authors

Kentaro Kawai¹, Mami Tomonou¹, Yume Machida¹, Yukiko Karuo¹, Atsushi Tarui¹, Kazuyuki Sato¹, Yoshiki Ikeda², Tatsuo Kinashi², Masaaki Omote¹

Affiliations

¹ Faculty of Pharmaceutical Sciences, Setsunan University, 45-1, Nagaotoge-cho, Hirakata, Osaka, 573-0101, Japan.
² Department of Molecular Genetics, Institute of Biomedical Science, Kansai Medical University, 2-5-1 Shin-machi, Hirakata, Osaka, 573-1010, Japan.

PMID: 33738924
DOI: 10.1002/minf.202060040

Abstract

Efficient in silico approaches are needed to identify strong integrin αIIbβ3 inhibitors through a small number of measurements. To address the challenge, we investigated the effect of learning dataset on the classification performance of machine learning models focusing on weak and inactive compounds. The structure and activity information of the compounds were obtained from ChEMBL, and pCHEMBL values were used to classify them as active, inactive, or weak. Datasets with various imbalance levels from active:inactive=1 : 1 to 1 : 1000 were used for the machine learning. The prediction scores of the weak samples were found to lie between the predictive values of active and inactive compounds. In addition, another dataset that consists of 149 actives and 6.9 million inactives was screened; the results indicated that the number of positive predictions decreased for models trained with a higher number of inactives. Although there is a trade-off between false positives and false negatives, for determination of compounds with strong activity using a reduced number of measurements, it is better to use a large number of inactives for learning and identifying compounds that score higher than the weak samples.

Keywords: Machine learning; in-silico screening; integrin αIIbβ3.

MeSH terms

Platelet Glycoprotein GPIIb-IIIa Complex / antagonists & inhibitors*

Substances

Platelet Glycoprotein GPIIb-IIIa Complex