Nucleotide patterns aiding in prediction of eukaryotic promoters

PLoS One. 2017 Nov 15;12(11):e0187243. doi: 10.1371/journal.pone.0187243. eCollection 2017.

Abstract

Computational analysis of promoters is hindered by the complexity of their architecture. In less studied genomes with complex organization, false positive promoter predictions are common. Accurate identification of transcription start sites and core promoter regions remains an unsolved problem. In this paper, we present a comprehensive analysis of genomic features associated with promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into "promoters" and "non-promoters" even in absence of the full-length cDNA sequences. These models may be built upon the maps of the distributions of sequence polymorphisms, RNA sequencing reads on genomic DNA, methylated nucleotides, transcription factor binding sites, as well as relative frequencies of nucleotides and their combinations. Positional clustering of binding sites shows that the cells of Oryza sativa utilize three distinct classes of transcription factors: those that bind preferentially to the [-500,0] region (188 "promoter-specific" transcription factors), those that bind preferentially to the [0,500] region (282 "5' UTR-specific" TFs), and 207 of the "promiscuous" transcription factors with little or no location preference with respect to TSS. For the most informative motifs, their positional preferences are conserved between dicots and monocots.

MeSH terms

  • Algorithms
  • Binding Sites
  • DNA Methylation
  • Eukaryota / genetics*
  • Evolution, Molecular
  • Nucleotides / metabolism*
  • Oryza / genetics
  • Promoter Regions, Genetic*
  • Transcription Factors / metabolism

Substances

  • Nucleotides
  • Transcription Factors

Grants and funding

AK was supported by a grant of the Federal Targeted Program “Research and development on priority directions of science and technology in Russia, 2014–2010”, Contract № 14.604.21.0101, unique identifier of the applied scientific project: RFMEFI60414X0101. AK's work was also supported by the following grants of the EU FP7 program: “SYSCOL”, “SysMedIBD”, “RESOLVE” and “MIMOMICS”. TT and MT were supported by the NSF Division of Environmental Biology (1456634). TT, MT and AB were supported by NSF STTR award 1622840. Additional funding was provided by GeneXplain GmbH in the form of salaries for AK, and by Softberry, Inc in the form of salary for VS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.