×

Multiple pattern matching: a Markov chain approach. (English) Zbl 1147.65005

Summary: RNA motifs typically consist of short, modular patterns that include base pairs formed within and between modules. Estimating the abundance of these patterns is of fundamental importance for assessing the statistical significance of matches in genomewide searches, and for predicting whether a given function has evolved many times in different species or arose from a single common ancestor.
In this manuscript, we review in an integrated and self-contained manner some basic concepts of automata theory, generating functions and transfer matrix methods that are relevant to pattern analysis in biological sequences. We formalize, in a general framework, the concept of Markov chain embedding to analyze patterns in random strings produced by a memoryless source. This conceptualization, together with the capability of automata to recognize complicated patterns, allows a systematic analysis of problems related to the occurrence and frequency of patterns in random strings.
The applications we present focus on the concept of synchronization of automata, as well as automata used to search for a finite number of keywords (including sets of patterns generated according to base pairing rules) in a general text.

MSC:

65C40 Numerical analysis or methods applied to Markov chains
68Q45 Formal languages and automata
92D20 Protein sequences, DNA sequences
68R15 Combinatorics on words
60J22 Computational methods in Markov chains
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aho A.V. and Corasick M.J. (1975). Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6): 333–340 · Zbl 0301.68048 · doi:10.1145/360825.360855
[2] Aston J.A.D. and Martin D.E.K. (2005). Waiting time distributions of competing patterns in higher-order Markovian sequences. J. Appl. Prob. 42(4): 977–988 · Zbl 1094.60005 · doi:10.1239/jap/1134587810
[3] Biggins J.D. and Cannings C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Adv. Appl. Prob. 19: 521–545 · Zbl 0629.60100 · doi:10.2307/1427406
[4] Benson, G.: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 573–580 (1999)
[5] Bourdeau V., Ferbeyre G., Pageau M., Paquin B. and Cedergren R. (1999). The distribution of RNA motifs in natural sequences. Nucleic Acids Res. 27(22): 4457–4467 · doi:10.1093/nar/27.22.4457
[6] Bender E.A. and Kochman F. (1993). The distribution of subword counts is usually normal. Eur. J. Comb. 14(4): 265–275 · Zbl 0776.68097 · doi:10.1006/eujc.1993.1030
[7] Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: RECOMB ’03: Proceedings of the seventh annual international conference on Research in computational molecular biology, pp. 67–75 (2003)
[8] Bussemaker H.J., Li H. and Siggia E.D. (2000). Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. USA 97(18): 10096–10100 · doi:10.1073/pnas.180265397
[9] Brémaud P. (1998). Markov Chains: Gibbs fields, Monte Carlo Simulation and Queues. Springer, Heidelberg · Zbl 0949.60009
[10] Bourdon, J., Vallée, B.: Generalized pattern matching statistics. In: Colloquium on Mathematics and Computer Science: Algorithms and Trees, Trends in Mathematics, pp. 249–265. Birkhauser, 2002 · Zbl 1034.68024
[11] Bourdon, J., Vallée, B.: Pattern matching statistics on correlated sources. In: Proceedings of the seventh Latin American Symposium on Theoretical Informatics (LATIN’06), pp. 224–237, Valdivia, Chile (2006) · Zbl 1145.68478
[12] Breen S., Waterman M.S. and Zhang N. (1985). Renewal theory for several patterns. J. Appl. Prob. 22: 228–234 · Zbl 0564.60081 · doi:10.2307/3213763
[13] Clément J., Flajolet P. and Vallée B. (2001). Dynamical sources in information theory: a general analysis of trie structures. Algorithmica 29(1): 307–369 · Zbl 1035.68039 · doi:10.1007/BF02679623
[14] Chen, X.: Limit theorems for functional of ergodic Markov chains with general state space, vol. 139. Memoirs of the American Mathematical Society, 1999 · Zbl 0952.60014
[15] Crochemore M. and Rytter W. (2002). Jewels of Stringology. World Scientific, Singapore · Zbl 1078.68151
[16] Cech T.R., Zaug A.J. and Grabowski P.J. (1981). In vitro splicing of the ribosomal RNA precursor of Tetrahymena: involvement of a guanosine nucleotide in the excision of the intervening sequence. Cell 27(3 Pt 2): 487–496 · doi:10.1016/0092-8674(81)90390-1
[17] Durrett R. (1999). Essentials of Stochastic Processes. Springer, Heidelberg · Zbl 0940.60004
[18] Durrett R. (2004). Probability: Theory and Examples, third edition. Duxbury Press, North Scituate
[19] Eddy S.R. and Durbin R. (1994). RNA sequence analysis using covariance models. Nucleic Acids Res. 22(11): 2079–2088 · doi:10.1093/nar/22.11.2079
[20] Ferbeyre G., Bourdeau V., Pageau M., Miramontes P. and Cedergren R. (2000). Distribution of hammerhead and hammerhead-like RNA motifs through the GenBank. Genome Res. 10(7): 1011–1019 · doi:10.1101/gr.10.7.1011
[21] Fu J.C. and Chang Y.M. (2002). On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials. J. Appl. Prob. 39(1): 70–80 · Zbl 1008.60031 · doi:10.1239/jap/1019737988
[22] Fu J.C. and Chang Y.M. (2003). On ordered series and later waiting time distributions in a sequence of Markov dependent multistate trials. J. Appl. Prob. 40(3): 623–642 · Zbl 1046.60011 · doi:10.1239/jap/1059060892
[23] Feller W. (1968). An Introduction to Probability Theory and Its Applications third edition. Wiley, New York · Zbl 0155.23101
[24] Felsenstein J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17(6): 368–376 · doi:10.1007/BF01734359
[25] Fu J.C. and Koutras M.V. (1994). Distribution theory of runs: a Markov chain approach. J. Am. Statist. Assoc. 89(427): 1050–1058 · Zbl 0806.60011 · doi:10.2307/2290933
[26] Flajolet P., Kirschenhofer P. and Tichy R.F. (1988). Deviations from uniformity in random strings. Probab. Th. Rel. Fields 80(1): 139–150 · Zbl 0638.68058 · doi:10.1007/BF00348756
[27] Fu, J.C., Lou, W.Y.W.: Distribution theory of runs and patterns and its applications. A finite Markov chain imbedding approach. World Scientific, Singapor (2003) · Zbl 1030.60063
[28] Flajolet, P., Sedgewick, R.: Analytic Combinatorics, 2006. Electronic version available online at http://algo.inria.fr/flajolet/Publications/book060418.pdf · Zbl 1165.05001
[29] Flajolet P., Szpankowski W. and Vallée B. (2006). Hidden word statistics. J. ACM 53(1): 147–183 · Zbl 1316.68111 · doi:10.1145/1120582.1120586
[30] Gani J. and Irle A. (1999). On patterns in sequences of random events. Mh. Math. 127: 295–309 · Zbl 0923.60087 · doi:10.1007/s006050050041
[31] Goulden I.P. and Jackson D.M. (2004). Combinatorial Enumeration. Dover, New York · Zbl 1099.05005
[32] Griffiths-Jones S., Moxon S., Marshall M., Khanna A., Eddy S.R. and Bateman A. (2005). RFAM: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33(Database issue): 121–124
[33] Gerber H.U. and Li S.-Y.R. (1981). The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain. Stoch. Process. Appl. 11(1): 101–108 · Zbl 0449.60050 · doi:10.1016/0304-4149(81)90025-9
[34] Guibas L.J. and Odlyzko A.M. (1978). Maximal prefix-synchronized codes. SIAM J. Appl. Math. 35(2): 401–418 · Zbl 0394.94024 · doi:10.1137/0135034
[35] Guibas L.J. and Odlyzko A.M. (1981). Periods in strings. J. Combin. Theory Ser. A 30(1): 19–42 · Zbl 0464.68070 · doi:10.1016/0097-3165(81)90038-8
[36] Guibas L.J. and Odlyzko A.M. (1981). String overlaps, pattern matching and nontransitive games. J. Comb. Theory Ser. A 30(2): 183–208 · Zbl 0454.68109 · doi:10.1016/0097-3165(81)90005-4
[37] Guerrier-Takada C., Gardiner K., Marsh T., Pace N. and Altman S. (1983). The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell 35(3 Pt 2): 849–857 · doi:10.1016/0092-8674(83)90117-4
[38] Hentze M.W., Caughman S.W., Rouault T.A., Barriocanal J.G., Dancis A., Harford J.B. and Klausner R.D. (1987). Identification of the iron-responsive element for the translational regulation of human ferritin mRNA. Science 238(4833): 1570–1573 · doi:10.1126/science.3685996
[39] Han Q. and Hirano K. (2003). Sooner and later waiting time problems for patterns in Markov dependent trials. J. Appl. Prob. 40(1): 73–86 · Zbl 1023.60017 · doi:10.1239/jap/1044476828
[40] Hopcroft J.E. and Ullman J.D. (1979). Introduction to automata theory, languages and computation. Addison-Wesley, Reading · Zbl 0426.68001
[41] Jones G.L. (2004). On the Markov chain central limit theorem. Probab. Surv. 1: 299–320 · Zbl 1189.60129 · doi:10.1214/154957804100000051
[42] Knight R., De Sterck H., Markel R., Smit S., Oshmyansky A. and Yarus M. (2005). Abundance of correctly folded RNA motifs in sequence space, calculated on computational grids. Nucleic Acids Res. 33(18): 5924–5935 · doi:10.1093/nar/gki886
[43] Klein R.J. and Eddy S.R. (2003). RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinform. 4: 44 · Zbl 05325884 · doi:10.1186/1471-2105-4-44
[44] Kimura M. (1981). Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78(1): 454–458 · Zbl 0511.92013 · doi:10.1073/pnas.78.1.454
[45] Knuth D.E., Pratt V.R. and Morris J.H. (1977). Fast pattern matching in strings. SIAM J. Comput. 6(2): 323–350 · Zbl 0372.68005 · doi:10.1137/0206024
[46] Kucherov, G., Noe, L., Roytberg, M.: A unifying framework for seed sensitivity and its application to subset seeds (extended abstract), (2006)
[47] Knight R. and Yarus M. (2003). Finding specific RNA motifs: function in a zeptomole world?. RNA 9(2): 218–230 · doi:10.1261/rna.2138803
[48] Lewis, B.P., Burge, C.B., Bartel, D.P.: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120(1), 15–20, Jan 2005. Letter
[49] Li S.-Y.R. (1980). A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Ann. Probab. 8(6): 1171–1176 · Zbl 0447.60006 · doi:10.1214/aop/1176994578
[50] Lladser, M.: Minimal markov chain embeddings of pattern problems. In: Proceedings of the 2007 Information Theory and Applications Workshop, University of California, San Diego (2007) · Zbl 1192.68971
[51] Lothaire M., Rota G.-C., Doran B., Ismail M., Lam T.Y., Wutwak E., Flajolet P. and Lutwak E. (2005). Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications). Cambridge University Press, Cambridge
[52] Lu C., Tej S.S., Luo S., Haudenschild C.D., Meyers B.C. and Green P.J. (2005). Elucidation of the small RNA component of the transcriptome. Science 309(5740): 1567–1569 · doi:10.1126/science.1114112
[53] Martin D. (2005). Distribution of the number of successes in success runs of length at least k in higher-order Markovian sequences. Methodol. Comput. Appl. Probab. 7(4): 543–554 · Zbl 1102.60011 · doi:10.1007/s11009-005-5007-9
[54] Nicodème P. (2003). Regexpcount, a symbolic package for counting problems on regular expressions and words. Fundamenta Informaticae 56(1-2): 71–88 · Zbl 1051.68154
[55] Nicodème P., Salvy B. and Flajolet P. (2002). Motif statistics. Theoret. Comput. Sci. 287(2): 593–617 · Zbl 1061.68118 · doi:10.1016/S0304-3975(01)00264-X
[56] Pozdnyakov V.I. and Kulldorff M. (2006). Waiting times for patterns and a method of gambling teams. Am. Math. Month. 113(2): 134–143 · Zbl 1143.60010 · doi:10.2307/27641864
[57] Park Y. and Spouge J.L. (2004). Searching for multiple words in a Markov sequence. INFORMS J. Comput. 16(4): 341–347 · Zbl 1239.60074 · doi:10.1287/ijoc.1040.0095
[58] Robin S.S. and Daudin J.J. (2001). Exact distribution of the distances between any occurrences of a set of words. Ann. Inst. Statist. Math. 53(4): 895–905 · Zbl 1006.60012 · doi:10.1023/A:1014633825822
[59] Régnier M. and Denise A. (2004). Rare events and conditional events on random strings. DMTCS 6(2): 191–214 · Zbl 1059.05008
[60] Rivas E. and Eddy S.R. (2000). The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics 16(4): 334–340 · doi:10.1093/bioinformatics/16.4.334
[61] Régnier M. (2000). A unified approach to word occurrences probabilities. Discrete Appl. Math. 104(1): 259–280 · Zbl 0987.92017 · doi:10.1016/S0166-218X(00)00195-5
[62] Régnier, M., Lifanov, A., Makeev, V.: Three variations on word counting. In: Proceedings German Conference on Bioinformatics, GCB’00, Heidelberg, pp. 75–82. Logos-Verlag, 2000
[63] Robin S., Rodolphe F. and Schbath S. (2005). DNA, Words and Models. Cambridge University Press, New York · Zbl 1185.92047
[64] Régnier M. and Szpankowski W. (1998). On pattern frequency occurrences in a Markovian sequence. Algorithmica 22(4): 631–649 · Zbl 0918.68108 · doi:10.1007/PL00009244
[65] Salehi-Ashtiani K. and Szostak J.W. (2001). In vitro evolution suggests multiple origins for the hammerhead ribozyme. Nature 414(6859): 82–84 · doi:10.1038/35102081
[66] Shao J. (2003). Mathematical Statistics, second edition. Springer, Heidelberg · Zbl 1018.62001
[67] Sipser, M.: Introduction to the Theory of Computation. International Thomson Publishing, (1996) · Zbl 1169.68300
[68] Singh R., Robida M.D. and Karimpour S. (2006). Building biological complexity with limited genes. Curr. Genom. 7: 97–114 · doi:10.2174/138920206777304669
[69] Sabeti P.C., Unrau P.J. and Bartel D.P. (1997). Accessing rare activities from random RNA sequences: the importance of the length of molecules in the starting pool. Chem. Biol. 4(10): 767–774 · doi:10.1016/S1074-5521(97)90315-X
[70] Tang J. and Breaker R.R. (2000). Structural diversity of self-cleaving ribozymes. Proc. Natl. Acad. Sci. USA 97(11): 5784–5789 · doi:10.1073/pnas.97.11.5784
[71] Vallée B. (2001). Dynamical sources in information theory: fundamental intervals and word prefixes. Algorithmica 29(1): 262–306 · Zbl 1009.94003 · doi:10.1007/BF02679622
[72] Waterman, M.S.: Introduction to computational biology: maps, sequences and genomes. Chapman & Hall, WAT m 95:1 1.Ex (1995) · Zbl 0831.92011
[73] Wilf H.S. (1994). Generatingfunctiology, second edition. Academic, New York
[74] Welch M., Majerfeld I. and Yarus M. (1997). 23S rRNA similarity from selection for peptidyl transferase mimicry. Biochemistry 36(22): 6614–6623 · doi:10.1021/bi963135j
[75] Winkler W., Nahvi A. and Breaker R.R. (2002). Thiamine derivatives bind messenger RNAs directly to regulate bacterial gene expression. Nature 419(6910): 952–956 · doi:10.1038/nature01145
[76] Yarus M., Caporaso J.G. and Knight R. (2005). Origins of the genetic code: the escaped triplet theory. Annu. Rev. Biochem. 74: 179–198 · doi:10.1146/annurev.biochem.74.082803.133119
[77] Yarus M. and Welch M. (2000). Peptidyl transferase: ancient and exiguous. Chem. Biol. 7(10): 187–190 · doi:10.1016/S1074-5521(00)00027-2
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.