Prediction of KRASG12C inhibitors using conjoint fingerprint and machine learning-based QSAR models

J Mol Graph Model. 2023 Jul:122:108466. doi: 10.1016/j.jmgm.2023.108466. Epub 2023 Apr 7.

Abstract

Kirsten rat sarcoma virus G12C (KRASG12C) is the major protein mutation associated with non-small cell lung cancer (NSCLC) severity. Inhibiting KRASG12C is therefore one of the key therapeutic strategies for NSCLC patients. In this paper, a cost-effective data driven drug design employing machine learning-based quantitative structure-activity relationship (QSAR) analysis was built for predicting ligand affinities against KRASG12C protein. A curated and non-redundant dataset of 1033 compounds with KRASG12C inhibitory activity (pIC50) was used to build and test the models. The PubChem fingerprint, Substructure fingerprint, Substructure fingerprint count, and the conjoint fingerprint-a combination of PubChem fingerprint and Substructure fingerprint count-were used to train the models. Using comprehensive validation methods and various machine learning algorithms, the results clearly showed that the XGBoost regression (XGBoost) achieved the highest performance in term of goodness of fit, predictivity, generalizability and model robustness (R2 = 0.81, Q2CV = 0.60, Q2Ext = 0.62, R2 - Q2Ext = 0.19, R2Y-Random = 0.31 ± 0.03, Q2Y-Random = -0.09 ± 0.04). The top 13 molecular fingerprints that correlated with the predicted pIC50 values were SubFPC274 (aromatic atoms), SubFPC307 (number of chiral-centers), PubChemFP37 (≥1 Chlorine), SubFPC18 (Number of alkylarylethers), SubFPC1 (number of primary carbons), SubFPC300 (number of 1,3-tautomerizables), PubChemFP621 (N-C:C:C:N structure), PubChemFP23 (≥1 Fluorine), SubFPC2 (number of secondary carbons), SubFPC295 (number of C-ONS bonds), PubChemFP199 (≥4 6-membered rings), PubChemFP180 (≥1 nitrogen-containing 6-membered ring), and SubFPC180 (number of tertiary amine). These molecular fingerprints were virtualized and validated using molecular docking experiments. In conclusion, this conjoint fingerprint and XGBoost-QSAR model demonstrated to be useful as a high-throughput screening tool for KRASG12C inhibitor identification and drug design.

Keywords: Deep neural network; Drug design; KRAS; Machine learning; QSAR; Random forest; Support vector regression; XGBoost.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Carcinoma, Non-Small-Cell Lung* / drug therapy
  • Humans
  • Lung Neoplasms* / drug therapy
  • Lung Neoplasms* / genetics
  • Machine Learning
  • Molecular Docking Simulation
  • Mutation
  • Proto-Oncogene Proteins p21(ras)
  • Quantitative Structure-Activity Relationship

Substances

  • Proto-Oncogene Proteins p21(ras)
  • KRAS protein, human