iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding

Nguyen Quoc Khanh Le; Edward Kien Yee Yapp; Quang-Thai Ho; N Nagasundaram; Yu-Yen Ou; Hui-Yuan Yeh

doi:10.1016/j.ab.2019.02.017

iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding

Anal Biochem. 2019 Apr 15:571:53-61. doi: 10.1016/j.ab.2019.02.017. Epub 2019 Feb 26.

Authors

Nguyen Quoc Khanh Le¹, Edward Kien Yee Yapp², Quang-Thai Ho³, N Nagasundaram⁴, Yu-Yen Ou³, Hui-Yuan Yeh⁵

Affiliations

¹ Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore. Electronic address: khanhle@ntu.edu.sg.
² Singapore Institute of Manufacturing Technology, 2 Fusionopolis Way, #08-04, Innovis, 138634, Singapore.
³ Department of Computer Science and Engineering, Yuan Ze University, 32003, Taiwan.
⁴ Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore.
⁵ Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore. Electronic address: hyyeh@ntu.edu.sg.

PMID: 30822398
DOI: 10.1016/j.ab.2019.02.017

Abstract

An enhancer is a short (50-1500bp) region of DNA that plays an important role in gene expression and the production of RNA and proteins. Genetic variation in enhancers has been linked to many human diseases, such as cancer, disorder or inflammatory bowel disease. Due to the importance of enhancers in genomics, the classification of enhancers has become a popular area of research in computational biology. Despite the few computational tools employed to address this problem, their resulting performance still requires improvements. In this study, we treat enhancers by the word embeddings, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to classify them. We present iEnhancer-5Step, a web server containing two-layer classifiers to identify enhancers and their strength. We are able to attain an independent test accuracy of 79% and 63.5% in the two layers, respectively. Compared to current predictors on the same dataset, our proposed method is able to yield superior performance as compared to the other methods. Moreover, this study provides a basis for further research that can enrich the field of applying natural language processing techniques in biological sequences. iEnhancer-5Step is freely accessible via http://biologydeep.com/fastenc/.

Keywords: Continuous bag of words; Regulatory transcription factor; Sequence analysis; Skip gram; Support vector machine; Two-layer classification.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology*
DNA / genetics*
Enhancer Elements, Genetic / genetics*
Humans
Sequence Analysis, DNA
Support Vector Machine*

Substances

DNA