OriC-ENS: A sequence-based ensemble classifier for predicting origin of replication in S. cerevisiae

Comput Biol Chem. 2021 Jun:92:107502. doi: 10.1016/j.compbiolchem.2021.107502. Epub 2021 Apr 26.

Abstract

DNA Replication plays the most crucial part in biological inheritance, ensuring an even flow of genetic information from parent to offspring. The beginning site of DNA Replication which is called the Origin of Replication (ORI), plays a significant role in understanding the molecular mechanisms and genomic analysis of DNA. Hence, it is paramount to accurately identify the origin of replication to gain a more accurate understanding of the biochemical and genomic properties of DNA. In this paper, We have proposed a new approach named OriC-ENS that uses sequence-based feature extraction techniques, K-mer, K-gapped Mono-Di, and Di Mono, and an ensemble classification technique that uses majority voting for the identification of Origin of Replication. We have used three SVM classifiers, one for the K-mer features and two more for K-Gapped Mono-Di and K-Gapped Di-mono features. Finally, we used majority voting to combine the prediction by each predictor. Experimental results on the S. Cerevisiae dataset have shown that our method achieves an accuracy of 91.62 % which outperforms other state-of-the-art methods by a significant margin. We have also tested our method using other evaluation metrics such as Matthews Correlation Coefficient (MCC), Area Under Curve(AUC), Sensitivity, and Specificity, where it has achieved a score of 0.83, 0.98, 0.90, and 0.92 respectively. We have further evaluated our model on an independent test set collected from OriDB, consisting of the sequences of Schizosaccharomyces pombe where we have seen that our model can predict the origin of replication efficiently and with great precision. We have made our python-based source code available at https://github.com/MehediAzim/OriC-ENS.

Keywords: Classification; Origin of replication; Prediction tool; Sequence based features.

MeSH terms

  • DNA Replication / genetics
  • Databases, Genetic
  • Genes, Fungal / genetics*
  • Saccharomyces cerevisiae / genetics*
  • Sequence Analysis, DNA*