BigMPI4py: Python Module for Parallelization of Big Data Objects Discloses Germ Layer Specific DNA Demethylation Motifs

IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1507-1522. doi: 10.1109/TCBB.2020.3043979. Epub 2022 Jun 3.

Abstract

Parallelization in Python integrates Message Passing Interface via the mpi4py module. Since mpi4py does not support parallelization of objects greater than 231 bytes, we developed BigMPI4py, a Python module that wraps mpi4py, supporting object sizes beyond this boundary. BigMPI4py automatically determines the optimal object distribution strategy, and uses vectorized methods, achieving higher parallelization efficiency. BigMPI4py facilitates the implementation of Python for Big Data applications in multicore workstations and High Performance Computer systems. We use BigMPI4py to speed-up the search for germ line specific de novo DNA methylated/unmethylated motifs from the 59 whole genome bisulfite sequencing DNA methylation samples from 27 human tissues of the ENCODE project. We developed a parallel implementation of the Kruskall-Wallis test to find CpGs with differential methylation across germ layers. The parallel evaluation of the significance of 55 million CpG achieved a 22x speedup with 25 cores allowing us an efficient identification of a set of hypermethylated genes in ectoderm and mesoderm-related tissues, and another set in endoderm-related tissues and finally, the discovery of germ layer specific DNA demethylation motifs. Our results point out that DNA methylation signal provide a higher degree of information for the demethylated state than for the methylated state. BigMPI4py is available at https://https://www.arauzolab.org/tools/bigmpi4py and https://gitlab.com/alexmascension/bigmpi4py and the Jupyter Notebook with WGBS analysis at https://gitlab.com/alexmascension/wgbs-analysis.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Big Data*
  • DNA / metabolism
  • DNA Demethylation*
  • DNA Methylation / genetics
  • Germ Layers / metabolism
  • Humans
  • Sequence Analysis, DNA / methods

Substances

  • DNA