A sequence-based prediction of Kruppel-like factors proteins using XGBoost and optimized features

Gene. 2021 Jun 30:787:145643. doi: 10.1016/j.gene.2021.145643. Epub 2021 Apr 18.

Abstract

Krüppel-like factors (KLF) refer to a group of conserved zinc finger-containing transcription factors that are involved in various physiological and biological processes, including cell proliferation, differentiation, development, and apoptosis. Some bioinformatics methods such as sequence similarity searches, multiple sequence alignment, phylogenetic reconstruction, and gene synteny analysis have also been proposed to broaden our knowledge of KLF proteins. In this study, we proposed a novel computational approach by using machine learning on features calculated from primary sequences. To detail, our XGBoost-based model is efficient in identifying KLF proteins, with accuracy of 96.4% and MCC of 0.704. It also holds a promising performance when testing our model on an independent dataset. Therefore, our model could serve as an useful tool to identify new KLF proteins and provide necessary information for biologists and researchers in KLF proteins. Our machine learning source codes as well as datasets are freely available at https://github.com/khanhlee/KLF-XGB.

Keywords: Feature selection; Kruppel-like factor; Protein sequence; SMOTE imbalance; Zinc finger; eXtreme Gradient Boosting.

Publication types

  • Comparative Study
  • Evaluation Study

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Animals
  • Computational Biology* / methods
  • Databases, Protein
  • Humans
  • Kruppel-Like Transcription Factors / analysis
  • Kruppel-Like Transcription Factors / chemistry*
  • Kruppel-Like Transcription Factors / genetics
  • Machine Learning
  • Models, Biological

Substances

  • Kruppel-Like Transcription Factors