Investigating long range correlation in DNA sequences using significance tests of conditional mutual information

Comput Biol Chem. 2014 Dec:53 Pt A:32-42. doi: 10.1016/j.compbiolchem.2014.08.007. Epub 2014 Aug 20.

Abstract

This study exploits the use of Markov chain order estimation from symbol sequences of systems exhibiting long memory or long range correlations (LRC), such as DNA sequences. In the presence of limited sequence length, LRC chain can be approximated by a high order Markov chain. For the order estimation, the parametric significance test of conditional mutual information IC(m) is applied, found in an earlier work to be suitable for high order estimation. Here, it is computationally optimized applying an iterative algorithm for calculating IC(m) at increasing order m, enabling the analysis of long symbol sequences of high Markov chain order or LRC. The simulation study shows that when the true order is reasonably small, the estimated order saturates at the true order with the increase of the symbol sequence length, while when the true order is very large or the chain has LRC, the estimated order increases logarithmically with the symbol sequence length. The order estimation shows a different dependence on the DNA sequence length for bacteria, the plant Arabidopsis thaliana and the human chromosome, indicating a different long memory structure in their DNA.

Keywords: Conditional mutual information; DNA sequence; Long range correlations; Markov chain order; Significance test.

MeSH terms

  • Algorithms
  • Arabidopsis / genetics*
  • Bacillus subtilis / genetics*
  • Chromosome Mapping / statistics & numerical data
  • Computer Simulation
  • DNA / genetics
  • Genome*
  • Haemophilus influenzae / genetics*
  • Humans
  • Markov Chains
  • Mycoplasma pneumoniae / genetics*
  • Sequence Analysis, DNA / statistics & numerical data*
  • Species Specificity

Substances

  • DNA