Predicting the cytotoxicity of chemicals using ensemble learning methods and molecular fingerprints

Zimo Yin; Haixin Ai; Li Zhang; Guofei Ren; Yuming Wang; Qi Zhao; Hongsheng Liu

doi:10.1002/jat.3785

Predicting the cytotoxicity of chemicals using ensemble learning methods and molecular fingerprints

J Appl Toxicol. 2019 Oct;39(10):1366-1377. doi: 10.1002/jat.3785. Epub 2019 Feb 14.

Authors

Zimo Yin¹, Haixin Ai^{2

3

4}, Li Zhang^{2

3

4}, Guofei Ren¹, Yuming Wang⁵, Qi Zhao⁶, Hongsheng Liu^{2

3

4}

Affiliations

¹ School of Information, Liaoning University, Shenyang, 110036, China.
² School of Life Science, Liaoning University, Shenyang, 110036, China.
³ Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Liaoning Province, Shenyang, 110036, China.
⁴ Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China.
⁵ Department of Breast Surgery, The First Hospital of China Medical University, Shenyang, Liaoning, 110001, China.
⁶ School of Mathematics, Liaoning University, Shenyang, 110036, China.

PMID: 30763981
DOI: 10.1002/jat.3785

Abstract

The prediction of compound cytotoxicity is an important part of the drug discovery process. However, it usually appears as poor predictive performance because the datasets are high-throughput and have a class-imbalance problem. In this study, several strategies of performing a structure-activity relationship study for a cytotoxic endpoint in the AID364 dataset were explored to solve the class-imbalance problem. Random forest adaboost was used as the base learners for 10 types of molecular fingerprints and an ensemble method and six data-balancing methods were applied to balance the classes. As a result, the ensemble model using MACCS fingerprint was found to be the best, giving area under the curve of 85.2% ± 0.35%, sensitivity of 81.8% ± 0.65%, and specificity of 76.0% ± 0.12% in fivefold cross-validation and area under the curve of 78.8%, sensitivity of 55.5% and specificity of 78.5% in external validation. Good performance also appeared on other datasets with different sizes/degrees of imbalance. To explore the structural commonality of cytotoxic compounds, several substructures were identified as an important reference for substructure alerts. The convincing results indicate that the proposed models are helpful in predicting the cytotoxicity of chemicals.

Keywords: class-imbalance problem; cytotoxicity; ensemble learning method; molecular fingerprint; substructure alerts.

Publication types

Research Support, Non-U.S. Gov't
Review

MeSH terms

Algorithms
Carcinogens / classification*
Carcinogens / toxicity*
Drug Discovery / classification*
Drug Discovery / methods*
Humans
Machine Learning*
Quantitative Structure-Activity Relationship*

Substances

Carcinogens

Abstract

Publication types

MeSH terms

Substances

Grants and funding