Identification of Words in Biological Sequences Under the Semi-Markov Hypothesis

Brenda Ivette Garcia-Maya; Nikolaos Limnios

doi:10.1089/cmb.2019.0253

Identification of Words in Biological Sequences Under the Semi-Markov Hypothesis

J Comput Biol. 2020 May;27(5):683-697. doi: 10.1089/cmb.2019.0253. Epub 2019 Sep 23.

Authors

Brenda Ivette Garcia-Maya¹, Nikolaos Limnios¹

Affiliation

¹ Laboratory of Applied Mathematics (LMAC), Université de Technologie de Compiègne, Sorbonne University, Compiègne, France.

PMID: 31545073
DOI: 10.1089/cmb.2019.0253

Abstract

Identifying a word (pattern) in a long sequence of letters is not an easy task. To achieve this objective, several models have been proposed under the assumption that the sequence of letters is described by a Markov chain. The Markovian hypothesis imposes restrictions on the distribution of the sojourn time in a state, which has geometric distribution in a discrete process. This is the main drawback when applying Markov chains to real problems. By contrast, semi-Markov processes are generalized. In semi-Markov processes, the sojourn time in a state can be governed by any distribution function. The goal of this article is to compute the first hitting time (position) of a word (pattern) in a semi-Markov sequence. To achieve this objective, we use the auxiliary prefix and backward chain. To give an example of the applications of the proposed model, the model is tested in a bacteriophage DNA sequence that is lacking the enzyme SmaI. We compute the probability that a word occurs for the first time after n nucleotides in a DNA sequence. The corresponding probability distribution, the mean waiting position, the variance, and rate of the occurrence of the word are obtained.

Keywords: DNA analysis; backward position process; first occurrence of a word; prefix process; semi-Markov chain.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Computational Biology / methods*
DNA / genetics*
Humans
Markov Chains*
Models, Statistical
Probability
Sequence Analysis, DNA / methods*

Substances

DNA