Investigating the Precise Identification of Citrullination Sites with High-Performance Score Metrics using a Powerful Computation Predicting Tool

Comb Chem High Throughput Screen. 2023 Sep 12. doi: 10.2174/1386207326666230912151932. Online ahead of print.

Abstract

Background: To elucidate the detailed mechanisms of citrullination at the molecular level and design drugs applicable to major human diseases, predicting protein citrullination sites (PCSs) is essential. Using experimental approaches to predict PCSs is time-consuming and costly. However, there is a limited scope of the current PCS predictors. In particular, most predictors are commonly used for PCS prediction and have limited performance scores.

Objective: This work aims to provide an improved sophisticated predictor of citrullination sites using a benchmark dataset in a machine learning platform.

Methods: This study presents a reliable citrullination site predictor based on a benchmark dataset containing a 1:1 ratio of positive and negative samples. We classified citrullination sites using the Composition of the K-Spaced Amino Acid Pairs (CKSAAP) and Support Vector Machine (SVM).

Results: We developed PCS predictors using integrated machine-learning methods that produced the highest average scores. Using 10-fold cross-validation on test datasets, the True Positive Rate (TPR) was 98.34%, the True Negative Rate (TNR) was 99.44%, the accuracy was 98.89%, the Mathew Correlation Coefficient (MCC) was 98.21%, the Area Under the ROC Curve (AUC) was 0.999, and the partial Area Under the ROC Curve (pAUC) was 0.1968.

Conclusion: According to overall performance, our developed predictor has a significantly higher implementation in comparison with the current tools on the same benchmark dataset. Moreover, it showed better performance metrics on both test and training datasets. Our developed predictor is promising and can be implemented as a complementary technique for identifying fast and precise citrullination sites.

Keywords: Citrullination site; Features encoding; Machine learning techniques; Post-translational modifications; Support Vector Machine..