RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition

Ting Fang; Zizheng Zhang; Rui Sun; Lin Zhu; Jingjing He; Bei Huang; Yi Xiong; Xiaolei Zhu

doi:10.1016/j.omtn.2019.10.008

RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition

Mol Ther Nucleic Acids. 2019 Dec 6:18:739-747. doi: 10.1016/j.omtn.2019.10.008. Epub 2019 Oct 18.

Authors

Ting Fang¹, Zizheng Zhang², Rui Sun³, Lin Zhu⁴, Jingjing He², Bei Huang⁵, Yi Xiong⁶, Xiaolei Zhu⁷

Affiliations

¹ School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China; School of Life Sciences, Anhui University, Hefei, Anhui 230601, China.
² School of Life Sciences, Anhui University, Hefei, Anhui 230601, China.
³ Beijing Baidu Netcom Sciences and Technology Co., Ltd., Beijing, China.
⁴ School of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, China.
⁵ School of Life Sciences, Anhui University, Hefei, Anhui 230601, China. Electronic address: beihuang@163.com.
⁶ State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China. Electronic address: xiongyi@sjtu.edu.cn.
⁷ School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China; School of Life Sciences, Anhui University, Hefei, Anhui 230601, China. Electronic address: xlzhu_mdl@hotmail.com.

Abstract

5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research.

Keywords: 5-methylcytosine site; nucleotide composition; post-transcriptional modification; prediction; support vector machine.