iEnhancer-SKNN: a stacking ensemble learning-based method for enhancer identification and classification using sequence information

Hao Wu; Mengdi Liu; Pengyu Zhang; Hongming Zhang

doi:10.1093/bfgp/elac057

iEnhancer-SKNN: a stacking ensemble learning-based method for enhancer identification and classification using sequence information

Brief Funct Genomics. 2023 May 18;22(3):302-311. doi: 10.1093/bfgp/elac057.

Authors

Hao Wu^{1

2}, Mengdi Liu¹, Pengyu Zhang¹, Hongming Zhang¹

Affiliations

¹ College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China.
² School of Software, Shandong University, Jinan, 250101, Shandong, China.

PMID: 36715222
DOI: 10.1093/bfgp/elac057

Abstract

Enhancers, a class of distal cis-regulatory elements located in the non-coding region of DNA, play a key role in gene regulation. It is difficult to identify enhancers from DNA sequence data because enhancers are freely distributed in the non-coding region, with no specific sequence features, and having a long distance with the targeted promoters. Therefore, this study presents a stacking ensemble learning method to accurately identify enhancers and classify enhancers into strong and weak enhancers. Firstly, we obtain the fusion feature matrix by fusing the four features of Kmer, PseDNC, PCPseDNC and Z-Curve9. Secondly, five K-Nearest Neighbor (KNN) models with different parameters are trained as the base model, and the Logistic Regression algorithm is utilized as the meta-model. Thirdly, the stacking ensemble learning strategy is utilized to construct a two-layer model based on the base model and meta-model to train the preprocessed feature sets. The proposed method, named iEnhancer-SKNN, is a two-layer prediction model, in which the function of the first layer is to predict whether the given DNA sequences are enhancers or non-enhancers, and the function of the second layer is to distinguish whether the predicted enhancers are strong enhancers or weak enhancers. The performance of iEnhancer-SKNN is evaluated on the independent testing dataset and the results show that the proposed method has better performance in predicting enhancers and their strength. In enhancer identification, iEnhancer-SKNN achieves an accuracy of 81.75%, an improvement of 1.35% to 8.75% compared with other predictors, and in enhancer classification, iEnhancer-SKNN achieves an accuracy of 80.50%, an improvement of 5.5% to 25.5% compared with other predictors. Moreover, we identify key transcription factor binding site motifs in the enhancer regions and further explore the biological functions of the enhancers and these key motifs. Source code and data can be downloaded from https://github.com/HaoWuLab-Bioinformatics/iEnhancer-SKNN.

Keywords: enhancer identification; sequence analysis; stacking ensemble learning; transcription factor motifs.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

DNA
Enhancer Elements, Genetic* / genetics
Machine Learning
Promoter Regions, Genetic / genetics
Sequence Analysis, DNA / methods
Software*

Substances

DNA