Inferred regulons are consistent with regulator binding sequences in E. coli

PLoS Comput Biol. 2024 Jan 22;20(1):e1011824. doi: 10.1371/journal.pcbi.1011824. eCollection 2024 Jan.

Abstract

The transcriptional regulatory network (TRN) of E. coli consists of thousands of interactions between regulators and DNA sequences. Regulons are typically determined either from resource-intensive experimental measurement of functional binding sites, or inferred from analysis of high-throughput gene expression datasets. Recently, independent component analysis (ICA) of RNA-seq compendia has shown to be a powerful method for inferring bacterial regulons. However, it remains unclear to what extent regulons predicted by ICA structure have a biochemical basis in promoter sequences. Here, we address this question by developing machine learning models that predict inferred regulon structures in E. coli based on promoter sequence features. Models were constructed successfully (cross-validation AUROC > = 0.8) for 85% (40/47) of ICA-inferred E. coli regulons. We found that: 1) The presence of a high scoring regulator motif in the promoter region was sufficient to specify regulatory activity in 40% (19/47) of the regulons, 2) Additional features, such as DNA shape and extended motifs that can account for regulator multimeric binding, helped to specify regulon structure for the remaining 60% of regulons (28/47); 3) investigating regulons where initial machine learning models failed revealed new regulator-specific sequence features that improved model accuracy. Finally, we found that strong regulatory binding sequences underlie both the genes shared between ICA-inferred and experimental regulons as well as genes in the E. coli core pan-regulon of Fur. This work demonstrates that the structure of ICA-inferred regulons largely can be understood through the strength of regulator binding sites in promoter regions, reinforcing the utility of top-down inference for regulon discovery.

MeSH terms

  • Bacteria / genetics
  • Bacterial Proteins / metabolism
  • Binding Sites / genetics
  • Escherichia coli* / genetics
  • Escherichia coli* / metabolism
  • Gene Expression Regulation, Bacterial / genetics
  • Promoter Regions, Genetic / genetics
  • Regulon* / genetics

Substances

  • Bacterial Proteins

Grants and funding

This work was funded by the Novo Nordisk Foundation Grant Numbers NNF10CC1016517 and NNF20CC0035580 (BOP and DCZ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.