Drug Target Identification with Machine Learning: How to Choose Negative Examples

Matthieu Najm; Chloé-Agathe Azencott; Benoit Playe; Véronique Stoven

doi:10.3390/ijms22105118

Drug Target Identification with Machine Learning: How to Choose Negative Examples

Int J Mol Sci. 2021 May 12;22(10):5118. doi: 10.3390/ijms22105118.

Authors

Matthieu Najm^{1

2

3}, Chloé-Agathe Azencott^{1

2

3}, Benoit Playe^{1

2

3}, Véronique Stoven^{1

2

3}

Affiliations

¹ Center for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
² Institut Curie, 75248 Paris, France.
³ INSERM U900, 75428 Paris, France.

Abstract

Identification of the protein targets of hit molecules is essential in the drug discovery process. Target prediction with machine learning algorithms can help accelerate this search, limiting the number of required experiments. However, Drug-Target Interactions databases used for training present high statistical bias, leading to a high number of false positives, thus increasing time and cost of experimental validation campaigns. To minimize the number of false positives among predicted targets, we propose a new scheme for choosing negative examples, so that each protein and each drug appears an equal number of times in positive and negative examples. We artificially reproduce the process of target identification for three specific drugs, and more globally for 200 approved drugs. For the detailed three drug examples, and for the larger set of 200 drugs, training with the proposed scheme for the choice of negative examples improved target prediction results: the average number of false positives among the top ranked predicted targets decreased, and overall, the rank of the true targets was improved.Our method corrects databases' statistical bias and reduces the number of false positive predictions, and therefore the number of useless experiments potentially undertaken.

Keywords: chemogenomic; drug discovery; false positive predictions; learning bias; machine learning; negative examples; random forests; support vector machines; target identification.

MeSH terms

Computational Biology / methods*
Drug Discovery / methods*
Humans
Machine Learning*
Pharmaceutical Preparations / chemistry*
Pharmaceutical Preparations / metabolism
Protein Interaction Mapping
Proteins / chemistry*
Proteins / metabolism
Software*
Support Vector Machine

Substances

Pharmaceutical Preparations
Proteins

Grants and funding

RF20190502488/Vaincre la Mucoviscidose