Analysis of strand-specific RNA-seq data using machine learning reveals the structures of transcription units in Clostridium thermocellum

Wen-Chi Chou; Qin Ma; Shihui Yang; Sha Cao; Dawn M Klingeman; Steven D Brown; Ying Xu

doi:10.1093/nar/gkv177

Analysis of strand-specific RNA-seq data using machine learning reveals the structures of transcription units in Clostridium thermocellum

Nucleic Acids Res. 2015 May 26;43(10):e67. doi: 10.1093/nar/gkv177. Epub 2015 Mar 12.

Authors

Wen-Chi Chou¹, Qin Ma¹, Shihui Yang², Sha Cao³, Dawn M Klingeman⁴, Steven D Brown⁴, Ying Xu⁵

Affiliations

¹ Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, GA 30602, USA BioEnergy Science Center, TN 37831, USA.
² BioEnergy Science Center, TN 37831, USA Biosciences Division, Oak Ridge National Laboratory, TN 37831, USA National Bioenergy Center, National Renewable Energy Laboratory, Golden, CO 80401, USA.
³ Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, GA 30602, USA.
⁴ BioEnergy Science Center, TN 37831, USA Biosciences Division, Oak Ridge National Laboratory, TN 37831, USA.
⁵ Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, GA 30602, USA BioEnergy Science Center, TN 37831, USA College of Computer Science and Technology and School of Public Health, Jilin University, Changchun, Jilin 130012, China xyn@bmb.uga.edu.

Abstract

Identification of transcription units (TUs) encoded in a bacterial genome is essential to elucidation of transcriptional regulation of the organism. To gain a detailed understanding of the dynamically composed TU structures, we have used four strand-specific RNA-seq (ssRNA-seq) datasets collected under two experimental conditions to derive the genomic TU organization of Clostridium thermocellum using a machine-learning approach. Our method accurately predicted the genomic boundaries of individual TUs based on two sets of parameters measuring the RNA-seq expression patterns across the genome: expression-level continuity and variance. A total of 2590 distinct TUs are predicted based on the four RNA-seq datasets. Among the predicted TUs, 44% have multiple genes. We assessed our prediction method on an independent set of RNA-seq data with longer reads. The evaluation confirmed the high quality of the predicted TUs. Functional enrichment analyses on a selected subset of the predicted TUs revealed interesting biology. To demonstrate the generality of the prediction method, we have also applied the method to RNA-seq data collected on Escherichia coli and achieved high prediction accuracies. The TU prediction program named SeqTU is publicly available at https://code.google.com/p/seqtu/. We expect that the predicted TUs can serve as the baseline information for studying transcriptional and post-transcriptional regulation in C. thermocellum and other bacteria.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Validation Study

MeSH terms

Artificial Intelligence*
Clostridium thermocellum / genetics*
Escherichia coli / genetics
Genome, Bacterial
Sequence Analysis, RNA / methods*
Transcription, Genetic*