Identification of Words in Biological Sequences Under the Semi-Markov Hypothesis

J Comput Biol. 2020 May;27(5):683-697. doi: 10.1089/cmb.2019.0253. Epub 2019 Sep 23.

Abstract

Identifying a word (pattern) in a long sequence of letters is not an easy task. To achieve this objective, several models have been proposed under the assumption that the sequence of letters is described by a Markov chain. The Markovian hypothesis imposes restrictions on the distribution of the sojourn time in a state, which has geometric distribution in a discrete process. This is the main drawback when applying Markov chains to real problems. By contrast, semi-Markov processes are generalized. In semi-Markov processes, the sojourn time in a state can be governed by any distribution function. The goal of this article is to compute the first hitting time (position) of a word (pattern) in a semi-Markov sequence. To achieve this objective, we use the auxiliary prefix and backward chain. To give an example of the applications of the proposed model, the model is tested in a bacteriophage DNA sequence that is lacking the enzyme SmaI. We compute the probability that a word occurs for the first time after n nucleotides in a DNA sequence. The corresponding probability distribution, the mean waiting position, the variance, and rate of the occurrence of the word are obtained.

Keywords: DNA analysis; backward position process; first occurrence of a word; prefix process; semi-Markov chain.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • DNA / genetics*
  • Humans
  • Markov Chains*
  • Models, Statistical
  • Probability
  • Sequence Analysis, DNA / methods*

Substances

  • DNA