dPromoter-XGBoost: Detecting promoters and strength by combining multiple descriptors and feature selection using XGBoost

Methods. 2022 Aug:204:215-222. doi: 10.1016/j.ymeth.2022.01.001. Epub 2022 Jan 5.

Abstract

Promoters play an irreplaceable role in biological processes and genetics, which are responsible for stimulating the transcription and expression of specific genes. Promoter abnormalities have been found in some diseases, and the level of promoter-binding transcription factors can be used as a marker before a disease occurs. Hence, detecting promoters from DNA sequences has important biological significance, particular, distinguishing strong promoters can help to elucidate differences in gene expression and the mechanisms of specific diseases. With the introduction of third-generation sequencing, it is difficult to match the speed of sequencing to the speed of labeling promoters experimentally. Many computing models have been designed to fill this gap and identify unlabeled DNA. However, their feature representation methods are very singular, which cannot reflect the information contained in the original samples. With the aim of avoiding information loss, we propose a computational model based on multiple descriptors and feature selection to jointly express samples. It is worth mentioning that a new feature descriptor called K-mer word vector is defined. The promoter model of multiple feature descriptors dominated by K-mer word vector achieves similar performance to existing methods, the sensitivity of 85.72% can distinguish the promoter more effectively than other methods. Furthermore, the performance of the promoter strength has surpassed published methods, and accuracy of 77.00% greatly improves the ability to distinguish between strong and weak promoters.

Keywords: Binary; K-mer word vector; Promoters; PseDNC; PseKNC; XGBoost.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Base Sequence
  • Promoter Regions, Genetic*