Identification of phosphorylation site using S-padding strategy based convolutional neural network

Health Inf Sci Syst. 2022 Sep 17;10(1):29. doi: 10.1007/s13755-022-00196-6. eCollection 2022 Dec.

Abstract

Purpose: Abnormal phosphorylation has been proved to associate with a variety of human diseases, and the identification of phosphorylation sites is one of the research hotspots in healthcare. The study of phosphorylation site prediction in deep learning models often introduces a variety of information, and the utilization of complex models limits the usage scenarios of the models.

Methods: An enhanced deep learning method with S-padding strategy based on convolutional neural network is proposed in this paper. The S-padding strategy forms a three-dimensional matrix with extension information from original amino acid sequences, and a corresponding 2D-CNN model is designed to abstract the comprehensive features of phosphorylation site area in protein sequences.

Results: The fivefold cross-validation experiments are conducted, and the results show the performance of the proposed method on human dataset can achieve an accuracy of 89.68 % on serine/threonine sites and 88.16 % on tyrosine sites, respectively. Furthermore, phosphorylation site prediction on different organisms obtains the accuracy, sensitivity, and specificity of over 0.85, indicating a potential capability on phosphorylation site prediction task. Comparison result with existing models shows that the proposed method obtains better performance on both accuracy and AUC value, and the proposed method can further improve performance with sufficient training data.

Conclusion: This method enables proteome-wide predictions via models trained on a large amount of phosphorylation data, further exploiting the potential of protein phosphorylation site identification, and helping to provide insights into phosphorylation mechanisms.

Keywords: Convolutional neural network; Feature extension; Protein phosphorylation site identification; S-padding strategy.