MMPatho: Leveraging Multilevel Consensus and Evolutionary Information for Enhanced Missense Mutation Pathogenic Prediction

Fang Ge; Muhammad Arif; Zihao Yan; Hanin Alahmadi; Apilak Worachartcheewan; Dong-Jun Yu; Watshara Shoombuatong

doi:10.1021/acs.jcim.3c00950

MMPatho: Leveraging Multilevel Consensus and Evolutionary Information for Enhanced Missense Mutation Pathogenic Prediction

J Chem Inf Model. 2023 Nov 27;63(22):7239-7257. doi: 10.1021/acs.jcim.3c00950. Epub 2023 Nov 10.

Authors

Fang Ge^{1

2}, Muhammad Arif^{3

4}, Zihao Yan⁵, Hanin Alahmadi⁶, Apilak Worachartcheewan⁴, Dong-Jun Yu⁵, Watshara Shoombuatong²

Affiliations

¹ School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, 9 Wenyuanlu, Nanjing 210023, China.
² Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
³ College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
⁴ Department of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
⁵ School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China.
⁶ College of Computer Science and Engineering, Taibah University, Madinah 344, Saudi Arabia.

Abstract

Understanding the pathogenicity of missense mutation (MM) is essential for shed light on genetic diseases, gene functions, and individual variations. In this study, we propose a novel computational approach, called MMPatho, for enhancing missense mutation pathogenic prediction. First, we established a large-scale nonredundant MM benchmark data set based on the entire Ensembl database, complemented by a focused blind test set specifically for pathogenic GOF/LOF MM. Based on this data set, for each mutation, we utilized Ensembl VEP v104 and dbNSFP v4.1a to extract variant-level, amino acid-level, individuals' outputs, and genome-level features. Additionally, protein sequences were generated using ENSP identifiers with the Ensembl API, and then encoded. The mutant sites' ESM-1b and ProtTrans-T5 embeddings were subsequently extracted. Then, our model group (MMPatho) was developed by leveraging upon these efforts, which comprised ConsMM and EvoIndMM. To be specific, ConsMM employs individuals' outputs and XGBoost with SHAP explanation analysis, while EvoIndMM investigates the potential enhancement of predictive capability by incorporating evolutionary information from ESM-1b and ProtT5-XL-U50, large protein language embeddings. Through rigorous comparative experiments, both ConsMM and EvoIndMM were capable of achieving remarkable AUROC (0.9836 and 0.9854) and AUPR (0.9852 and 0.9902) values on the blind test set devoid of overlapping variations and proteins from the training data, thus highlighting the superiority of our computational approach in the prediction of MM pathogenicity. Our Web server, available at http://csbio.njust.edu.cn/bioinf/mmpatho/, allows researchers to predict the pathogenicity (alongside the reliability index score) of MMs using the ConsMM and EvoIndMM models and provides extensive annotations for user input. Additionally, the newly constructed benchmark data set and blind test set can be accessed via the data page of our web server.

MeSH terms

Computational Biology*
Consensus
Humans
Mutation, Missense*
Proteins
Reproducibility of Results

Substances

Proteins