Identifying a word (pattern) in a long sequence of letters is not an easy task. To achieve this objective, several models have been proposed under the assumption that the sequence of letters is described by a Markov chain. The Markovian hypothesis imposes restrictions on the distribution of the sojourn time in a state, which has geometric distribution in a discrete process. This is the main drawback when applying Markov chains to real problems. By contrast, semi-Markov processes are generalized. In semi-Markov processes, the sojourn time in a state can be governed by any distribution function. The goal of this article is to compute the first hitting time (position) of a word (pattern) in a semi-Markov sequence. To achieve this objective, we use the auxiliary prefix and backward chain. To give an example of the applications of the proposed model, the model is tested in a bacteriophage DNA sequence that is lacking the enzyme SmaI. We compute the probability that a word occurs for the first time after n nucleotides in a DNA sequence. The corresponding probability distribution, the mean waiting position, the variance, and rate of the occurrence of the word are obtained.
Keywords: DNA analysis; backward position process; first occurrence of a word; prefix process; semi-Markov chain.