Protein p Ka Prediction by Tree-Based Machine Learning

Ada Y Chen; Juyong Lee; Ana Damjanovic; Bernard R Brooks

doi:10.1021/acs.jctc.1c01257

Protein p K_a Prediction by Tree-Based Machine Learning

J Chem Theory Comput. 2022 Apr 12;18(4):2673-2686. doi: 10.1021/acs.jctc.1c01257. Epub 2022 Mar 15.

Authors

Ada Y Chen^{1

2}, Juyong Lee³, Ana Damjanovic⁴, Bernard R Brooks²

Affiliations

¹ Department of Physics & Astronomy, Johns Hopkins University, Baltimore, Maryland 21218, United States.
² Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland 20892, United States.
³ Department of Chemistry, Division of Chemistry and Biochemistry, Kangwon National University, 1 Gangwondaehak-gil, Chuncheon 24341, Republic of Korea.
⁴ Department of Biophysics, Johns Hopkins University, Baltimore, Maryland 21218, United States.

Abstract

Protonation states of ionizable protein residues modulate many essential biological processes. For correct modeling and understanding of these processes, it is crucial to accurately determine their pK_a values. Here, we present four tree-based machine learning models for protein pK_a prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pK_a datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pK_a prediction tool PROPKA and 15% better than the published result from the pK_a prediction method DelPhiPKa. The overall root-mean-square error (RMSE) for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys, and Tyr), and 0.63 when considering Asp, Glu, His, and Lys only. We provide pK_a predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pK_a values close to the physiological pH.

MeSH terms

Algorithms
Humans
Kinetics
Machine Learning*
Proteins* / chemistry

Substances

Proteins

Grants and funding

ZIA HL001051/ImNIH/Intramural NIH HHS/United States