Linear regression models predicting strength of transcriptional activity of promoters

Genome Inform. 2011;25(1):53-60.

Abstract

We developed linear regression models which predict strength of transcriptional activity of promoters from their sequences. Intrinsic transcriptional strength data of 451 human promoter sequences in three cell lines (HEK293, MCF7 and 3T3), which were measured by systematic luciferase reporter gene assays, were used to build the models. The models sum up contributions of CG dinucleotide content and transcription factor binding sites (TFBSs) to transcriptional strength. We evaluated prediction accuracies of the models by cross validation tests and found that they have adequate ability for predicting transcriptional strength of promoters in spite of their simple formalization. We also evaluated statistical significance of the contributions and proposed a picture of regulatory code hidden in promoter sequences. That is, CG dinucleotide content and TFBSs mainly determine strength of transcriptional activity under ubiquitous and specific environments, respectively.

MeSH terms

  • 3T3 Cells
  • Animals
  • Base Composition
  • Binding Sites
  • HEK293 Cells
  • Humans
  • Linear Models
  • MCF-7 Cells
  • Mice
  • Models, Genetic*
  • Promoter Regions, Genetic*
  • Transcription Factors / metabolism
  • Transcription, Genetic*

Substances

  • Transcription Factors