Using Machine Learning Approaches to Predict Target Gene Expression in Rice T-DNA Insertional Mutants

Ching-Hsuan Chien; Lan-Ying Huang; Shuen-Fang Lo; Liang-Jwu Chen; Chi-Chou Liao; Jia-Jyun Chen; Yen-Wei Chu

doi:10.3389/fgene.2021.798107

Using Machine Learning Approaches to Predict Target Gene Expression in Rice T-DNA Insertional Mutants

Front Genet. 2021 Dec 17:12:798107. doi: 10.3389/fgene.2021.798107. eCollection 2021.

Authors

Ching-Hsuan Chien¹, Lan-Ying Huang¹, Shuen-Fang Lo², Liang-Jwu Chen^{3

4}, Chi-Chou Liao³, Jia-Jyun Chen⁵, Yen-Wei Chu^{1

2

3

5

6

7

8}

Affiliations

¹ Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan.
² Biotechnology Center, National Chung Hsing University, Taichung, Taiwan.
³ Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan.
⁴ Advanced Plant Biotechnology Center National Chung Hsing University, Taichung, Taiwan.
⁵ Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan.
⁶ Agricultural Biotechnology Center, National Chung Hsing University, Taichung, Taiwan.
⁷ Ph.D. Program in Translational Medicine, National Chung Hsing University, Taichung, Taiwan.
⁸ Rong Hsing Research Center for Translational Medicine, National Chung Hsing University, Taichung, Taiwan.

Abstract

To change the expression of the flanking genes by inserting T-DNA into the genome is commonly used in rice functional gene research. However, whether the expression of a gene of interest is enhanced must be validated experimentally. Consequently, to improve the efficiency of screening activated genes, we established a model to predict gene expression in T-DNA mutants through machine learning methods. We gathered experimental datasets consisting of gene expression data in T-DNA mutants and captured the PROMOTER and MIDDLE sequences for encoding. In first-layer models, support vector machine (SVM) models were constructed with nine features consisting of information about biological function and local and global sequences. Feature encoding based on the PROMOTER sequence was weighted by logistic regression. The second-layer models integrated 16 first-layer models with minimum redundancy maximum relevance (mRMR) feature selection and the LADTree algorithm, which were selected from nine feature selection methods and 65 classified methods, respectively. The accuracy of the final two-layer machine learning model, referred to as TIMgo, was 99.3% based on fivefold cross-validation, and 85.6% based on independent testing. We discovered that the information within the local sequence had a greater contribution than the global sequence with respect to classification. TIMgo had a good predictive ability for target genes within 20 kb from the 35S enhancer. Based on the analysis of significant sequences, the G-box regulatory sequence may also play an important role in the activation mechanism of the 35S enhancer.

Keywords: CaMV 35S enhancer; T-DNA activation tagging; gene expression; machine learning; rice.