Distributions associated with general runs and patterns in hidden Markov models. (English) Zbl 1126.62008

Summary: This paper gives a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence. Probabilities are computed for very general classes of patterns (competing patterns and generalized later patterns), and thus, the theory includes as special cases results for a large class of problems that have wide applications.
The unobserved state sequence is assumed to be Markovian with a general order of dependence. An auxiliary Markov chain is associated with the state sequence and is used to simplify the computations. Two examples are given to illustrate the use of the methodology. Whereas the first application is more to illustrate the basic steps in applying the theory, the second is a more detailed application to DNA sequences, and shows that the methods can be adapted to include restrictions related to biological knowledge.


62E15 Exact distribution theory in statistics
62P10 Applications of statistics to biology and medical sciences; meta analysis
65C40 Numerical analysis or methods applied to Markov chains
65C60 Computational problems in statistics (MSC2010)
Full Text: DOI arXiv Euclid


[1] Aston, J. A. D. and Martin, D. E. K. (2005). Waiting time distributions of competing patterns in higher-order Markovian sequences. J. Appl. Probab. 42 977-988. · Zbl 1094.60005
[2] Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Appl. Statist. 39 357-365. · Zbl 0707.62186
[3] Balakrishnan, N. and Koutras, M. (2002). Runs and Scans with Applications . Wiley, New York. · Zbl 0991.62087
[4] Barlow, K. (2005). AL133339 locus of chromosome 20. EMBL/GenBank/DDBJ databases.
[5] Baum, L. E. and Eagon, J. A. (1967). An equality with applications to statistical estimation for probabilistic functions of finite state Markov chains. Bull. Amer. Math. Soc. 73 360-363. · Zbl 0157.11101
[6] Bird, A. (1987). CpG islands as gene markers in the vertebrate nucleus. Trends in Genetics 3 342-347.
[7] Cappé, O., Moulines, E. and Rydén, T. (2005). Inference in Hidden Markov Models . Springer, New York. · Zbl 1080.62065
[8] Cheung, L. W. K. (2004). Use of runs statistics for pattern recognition in genomic DNA sequences. J. Comput. Biol. 11 107-124.
[9] Ching, W.-K., Ng, M. K. and Fung, E. (2003). Higher-order hidden Markov models with applications to DNA sequences. Lecture Notes in Comput. Sci. 2690 535-539. Springer, Berlin.
[10] Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51 79-94. · Zbl 0662.92012
[11] Cox, D. R. (1955). The analysis of non-Markovian stochastic processes by the inclusion of supplementary variables. Proc. Cambridge Philos. Soc. 51 433-441. · Zbl 0067.10902
[12] Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis . Cambridge Univ. Press. · Zbl 0929.92010
[13] Fu, J. C. and Chang, Y. M. (2003). On ordered series and later waiting time distributions in a sequence of Markov dependent multistate trials. J. Appl. Probab. 40 623-642. · Zbl 1046.60011
[14] Fu, J. C. and Koutras, M. V. (1994). Distribution theory of runs: A Markov chain approach. J. Amer. Statist. Assoc. 89 1050-1058. · Zbl 0806.60011
[15] Guéguen, L. (2005). Sarment: Python modules for HMM analysis and partitioning of sequences. Bioinformatics 21 3427-3428.
[16] Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57 357-384. · Zbl 0685.62092
[17] Harvey, A. C. (1989). Forecasting , Structural Time Series Models and the Kalman Filter . Cambridge Univ. Press.
[18] Hwang, G. U., Choi, B. D. and Kim, J.-K. (2002). The waiting time analysis of a discrete-time queue with arrivals as a discrete autoregressive process of order 1. J. Appl. Probab. 39 619-629. · Zbl 1016.60086
[19] Koski, T. (2001). Hidden Markov Models for Bioinformatics . Kluwer Academic Publishers, Dordrecht. · Zbl 0983.92001
[20] Koutras, M. V. and Alexandrou, V. A. (1995). Runs, scans and urn model distributions: A unified Markov chain approach. Ann. Inst. Statist. Math. 47 743-766. · Zbl 0848.60021
[21] Krogh, A. (1997). Two methods for improving performance of an HMM and their application for gene finding. In Fifth International Conference on Intelligent Systems for Molecular Biology (T. Gaasterland et al., eds.) 179-186. AAAI Press.
[22] Krogh, A., Mian, I. S. and Haussler, D. (1994). A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Research 22 1501-1531.
[23] Li, J. and Gray, R. M. (2000). Image Segmentation and Compression Using Hidden Markov Models . Kluwer Academic Publishers. · Zbl 0984.68178
[24] Lindgren, G. (1978). Markov regime models for mixed distributions and switching regressions. Scand. J. Statist. 5 81-91. · Zbl 0382.62073
[25] Macdonald, I. L. and Zucchini, W. (1997). Hidden Markov and other Models for Discrete-valued Time Series . Chapman and Hall, London. · Zbl 0868.60036
[26] Martin, D. E. K. and Aston, J. A. D. (2007). Waiting time distribution of generalized later patterns. · Zbl 1452.62095
[27] Naus, J. I. (1982). Approximations for distributions of scan statistics. J. Amer. Statist. Assoc. 77 177-183. · Zbl 0482.62010
[28] Perkins, S. (1997). Inside Old Faithful: Scientists look down the throat of a geyser. Science News Oct 11 1997 232.
[29] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 257-286.
[30] Raftery, A. E. (1985). A model for high-order Markov chains. J. Roy. Statist. Soc. Ser. B 47 528-539. · Zbl 0593.62091
[31] Reinert, G., Schbath, S. and Waterman, M. S. (2005). Probabilistic and statistical properties of finite words in finite sequences. In Lothaire : Applied Combinatorics on Words (J. Berstel and D. Perrin, eds.). Cambridge Univ. Press. · Zbl 1133.68067
[32] Saxonov, S., Berg, P. and Brutlag, D. L. (2006). A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc. Natl. Acad. Sci. USA 103 1412-1417.
[33] Sims, C. A. and Zha, T. (2006). Were there regime switches in US monetary policy? American Economic Review 96 54-81.
[34] Smith, M. (2007). AL117335 locus of chromosome 20. EMBL/GenBank/DDBJ databases.
[35] Takai, D. and Jones, P. A. (2002). Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl. Acad. Sci. USA 99 3740-3745.
[36] Takai, D. and Jones, P. A. (2003). The CpG island searcher: A new WWW resource. In Silico Biology 3 0021.
[37] Viterbi, A. (1967). Error bounds for convolutions codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inform. Theory 13 260-269. · Zbl 0148.40501
[38] Yuan, M. and Kendziorski, C. (2006). Hidden Markov models for microarray time course data in multiple biological conditions. J. Amer. Statist. Assoc. 101 1323-1334. · Zbl 1171.62359
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.