Predicting protein phosphorylation sites in soybean using interpretable deep tabular learning network

Elham Khalili; Shahin Ramazi; Faezeh Ghanati; Samaneh Kouchaki

doi:10.1093/bib/bbac015

Predicting protein phosphorylation sites in soybean using interpretable deep tabular learning network

Brief Bioinform. 2022 Mar 10;23(2):bbac015. doi: 10.1093/bib/bbac015.

Authors

Elham Khalili¹, Shahin Ramazi², Faezeh Ghanati¹, Samaneh Kouchaki³

Affiliations

¹ Department of Plant Science, Faculty of Science, Tarbiat Modarres University, Tehran, Iran.
² Department of Biophysics, Faculty of Biological Science, Tarbiat Modares University, Tehran, Iran.
³ Department of Electrical and Electronic Engineering, .Faculty of Engineering and Physical Sciences, Centre for Vision, Speech, and Signal Processing, University of Surrey, Guildford, UK.

PMID: 35152280
DOI: 10.1093/bib/bbac015

Abstract

Phosphorylation of proteins is one of the most significant post-translational modifications (PTMs) and plays a crucial role in plant functionality due to its impact on signaling, gene expression, enzyme kinetics, protein stability and interactions. Accurate prediction of plant phosphorylation sites (p-sites) is vital as abnormal regulation of phosphorylation usually leads to plant diseases. However, current experimental methods for PTM prediction suffers from high-computational cost and are error-prone. The present study develops machine learning-based prediction techniques, including a high-performance interpretable deep tabular learning network (TabNet) to improve the prediction of protein p-sites in soybean. Moreover, we use a hybrid feature set of sequential-based features, physicochemical properties and position-specific scoring matrices to predict serine (Ser/S), threonine (Thr/T) and tyrosine (Tyr/Y) p-sites in soybean for the first time. The experimentally verified p-sites data of soybean proteins are collected from the eukaryotic phosphorylation sites database and database post-translational modification. We then remove the redundant set of positive and negative samples by dropping protein sequences with >40% similarity. It is found that the developed techniques perform >70% in terms of accuracy. The results demonstrate that the TabNet model is the best performing classifier using hybrid features and with window size of 13, resulted in 78.96 and 77.24% sensitivity and specificity, respectively. The results indicate that the TabNet method has advantages in terms of high-performance and interpretability. The proposed technique can automatically analyze the data without any measurement errors and any human intervention. Furthermore, it can be used to predict putative protein p-sites in plants effectively. The collected dataset and source code are publicly deposited at https://github.com/Elham-khalili/Soybean-P-sites-Prediction.

Keywords: computational prediction, interpretable deep tabular learning network (TabNet); machine learning; protein phosphorylation; soybean.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Amino Acid Sequence
Computational Biology / methods
Glycine max* / genetics
Humans
Machine Learning
Phosphorylation
Protein Processing, Post-Translational*