T4SEpp: A pipeline integrating protein language models to predict bacterial type IV secreted effectors

Yueming Hu; Yejun Wang; Xiaotian Hu; Haoyu Chao; Sida Li; Qinyang Ni; Yanyan Zhu; Yixue Hu; Ziyi Zhao; Ming Chen

doi:10.1016/j.csbj.2024.01.015

T4SEpp: A pipeline integrating protein language models to predict bacterial type IV secreted effectors

Comput Struct Biotechnol J. 2024 Jan 23:23:801-812. doi: 10.1016/j.csbj.2024.01.015. eCollection 2024 Dec.

Authors

Yueming Hu¹, Yejun Wang^{2

3}, Xiaotian Hu¹, Haoyu Chao¹, Sida Li¹, Qinyang Ni¹, Yanyan Zhu¹, Yixue Hu², Ziyi Zhao², Ming Chen^{1

4}

Affiliations

¹ Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China.
² Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen, China.
³ Department of Cell Biology and Genetics, College of Basic Medicine, Shenzhen University Medical School, Shenzhen, China.
⁴ Institute of Hematology, Zhejiang University School of Medicine, The First Affiliated Hospital, Zhejiang University, Hangzhou 310058, China.

Abstract

Many pathogenic bacteria use type IV secretion systems (T4SSs) to deliver effectors (T4SEs) into the cytoplasm of eukaryotic cells, causing diseases. The identification of effectors is a crucial step in understanding the mechanisms of bacterial pathogenicity, but this remains a major challenge. In this study, we used the full-length embedding features generated by six pre-trained protein language models to train classifiers predicting T4SEs and compared their performance. We integrated three modules into a model called T4SEpp. The first module searched for full-length homologs of known T4SEs, signal sequences, and effector domains; the second module fine-tuned a machine learning model using data for a signal sequence feature; and the third module used the three best-performing pre-trained protein language models. T4SEpp outperformed other state-of-the-art (SOTA) software tools, achieving ∼0.98 accuracy at a high specificity of ∼0.99, based on the assessment of an independent validation dataset. T4SEpp predicted 13 T4SEs from Helicobacter pylori, including the well-known CagA and 12 other potential ones, among which eleven could potentially interact with human proteins. This suggests that these potential T4SEs may be associated with the pathogenicity of H. pylori. Overall, T4SEpp provides a better solution to assist in the identification of bacterial T4SEs and facilitates studies of bacterial pathogenicity. T4SEpp is freely accessible at https://bis.zju.edu.cn/T4SEpp.

Keywords: Deep learning; Helicobacter pylori T4SEs; Protein language model; T4SE Prediction; T4SEpp; T4SS.