Mb-level CpG and TFBS islands visualized by AI and their roles in the nuclear organization of the human genome

Genes Genet Syst. 2020 Apr 22;95(1):29-41. doi: 10.1266/ggs.19-00027. Epub 2020 Mar 12.

Abstract

Unsupervised machine learning that can discover novel knowledge from big sequence data without prior knowledge or particular models is highly desirable for current genome study. We previously established a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions, which can reveal various novel genome characteristics from big sequence data, and found that transcription factor binding sequences (TFBSs) and CpG-containing oligonucleotides are enriched in human centromeric and pericentromeric regions, which support centromere clustering and form the condensed heterochromatin "chromocenter" in interphase nuclei. The number and size of chromocenters, as well as the type of centromeres gathered in individual chromocenters, vary depending on cell type. To study molecular mechanisms of cell type-dependent chromocenter formation, we analyzed distribution patterns of occurrence per Mb of hexa- and heptanucleotide TFBSs, which have been compiled by the SwissRegulon Portal, and of CpG-containing oligonucleotides. We found Mb-level islands enriched for TFBSs and CpG-containing oligonucleotides in centromeric and pericentromeric regions on all human chromosomes except chrY. Considering molecular mechanisms for cell type-dependent centromere clustering, the chromosome-dependent enrichment of a set of TFBSs and CpG-containing oligonucleotides is of particular interest, since the cellular content of TFs and methyl-CpG-binding proteins exhibits cell type-dependent regulation. A newly introduced BLSOM, which analyzed occurrences of a total of 3,946 octanucleotide TFBSs compiled by the SwissRegulon Portal, has self-organized (separated) the sequences that are characteristically enriched in TFBSs and shown that these sequences are derived primarily from centromeric and pericentromeric constitutive heterochromatin regions. Furthermore, the BLSOM identified and visualized characteristic TFBSs that are enriched in these regions. By analyzing Hi-C data for interchromosomal interactions, the present study showed that the chromatin segments supporting the interchromosomal interactions locate primarily in Mb-level TFBS and CpG islands and are thus enriched for a wide variety of TFBSs and CG-containing oligonucleotides.

Keywords: Hi-C; Self-Organizing Map; big data; oligonucleotide composition; unsupervised machine learning.

MeSH terms

  • Artificial Intelligence*
  • Binding Sites
  • Centromere / genetics
  • Chromosomes, Human / genetics*
  • CpG Islands / genetics*
  • Genome, Human / genetics*
  • Heterochromatin / genetics
  • Humans
  • Oligonucleotides / genetics
  • Protein Binding
  • Transcription Factors / genetics
  • Transcription Factors / metabolism

Substances

  • Heterochromatin
  • Oligonucleotides
  • Transcription Factors