# zbMATH — the first resource for mathematics

Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source. (English) Zbl 1291.60148
Summary: We present two novel approaches for the computation of the exact distribution of a pattern in a long sequence. Both approaches take into account the sparse structure of the problem and are two-part algorithms. The first approach relies on a partial recursion after a fast computation of the second largest eigenvalue of the transition matrix of a Markov chain embedding. The second approach uses fast Taylor expansions of an exact bivariate rational reconstruction of the distribution.
We illustrate the interest of both approaches on a simple toy example and two biological applications: the transcription factors of the Human Chromosome 10 and the PROSITE signatures of functional motifs in proteins. On these examples our methods demonstrate their complementarity and their ability to extend the domain of feasibility for exact computations in pattern problems to a new level.

##### MSC:
 60J10 Markov chains (discrete-time Markov processes on discrete state spaces) 60J20 Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.) 92D20 Protein sequences, DNA sequences 68Q25 Analysis of algorithms and problem complexity
RE2; PCRE; SPatt
Full Text:
##### References:
 [1] Aho, A.; Corasick, M., Efficient string matching: an aid to bibliographic search, Commun. ACM, 18, 6, 333-340, (1975) · Zbl 0301.68048 [2] Allauzen, C.; Mohri, M., A unified construction of the glushkov, follow, and antimirov automata, (Mathematical Foundations of Computer Science, (2006)), 110-121 · Zbl 1132.68434 [3] Antzoulakos, D. L., Waiting times for patterns in a sequence of multistate trials, J. Appl. Probab., 38, 508-518, (2001) · Zbl 0985.60072 [4] Beaudoing, E.; Freier, S.; Wyatt, J.; Claverie, J.-M.; Gautheret, D., Patterns of variant polyadenylation signal usage in human genes, Genome Res., 10, 7, 1001-1010, (2000) [5] Boeva, V.; Clement, J.; Régnier, M.; Roytberg, M.; Makeev, V., Exact $$p$$-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules, Algorithms Mol. Biol., 2, 13, (2007) [6] Boeva, V.; Clément, J.; Régnier, M.; Vandenbogaert, M., Assessing the significance of sets of words, (Combinatorial Pattern Matching 05, Lecture Notes in Computer Science, vol. 3537, (2005), Springer-Verlag), 358-370 · Zbl 1131.68490 [7] Brazma, A.; Jonassen, I.; Vilo, J.; Ukkonen, E., Predicting gene regulatory elements in silico on a genomic scale, Genome Res., 8, 11, 1202-1215, (1998) [8] Chang, Y.-M., Distribution of waiting time until the $$r$$th occurrence of a compound pattern, Statist. Probab. Lett., 75, 1, 29-38, (2005) · Zbl 1081.60060 [9] Cormen, T. H.; Leiserson, C. E.; Rivest, R. L., Introduction to algorithms, (1990), MIT Press, pp. 853-885 (Chapter 34) [10] Cowan, Expected frequencies of DNA patterns using whittle’s formula, J. Appl. Probab., 28, 886-892, (1991) · Zbl 0741.60071 [11] Crochemore, M.; Hancart, C., Automata for matching patterns, (Rozenberg, G.; Salomaa, A., Handbook of Formal Languages, Linear Modeling: Background and Application, vol. 2, (1997), Springer-Verlag Berlin), 399-462, (Chapter 9) [12] Crochemore, M.; Stefanov, V., Waiting time and complexity for matching patterns with automata, Inform. Process. Letters, 87, 3, 119-125, (2003) · Zbl 1161.68760 [13] Denise, A.; Régnier, M.; Vandenbogaert, M., Assessing the statistical significance of overrepresented oligonucleotides, Lect. Notes Comput. Sci., 2149, 85-97, (2001) · Zbl 1129.92303 [14] El Karoui, M.; Biaudet, V.; Schbath, S.; Gruss, A., Characteristics of chi distribution on different bacterial genomes, Res. Microbiol., 150, 579-587, (1999) [15] Erhardsson, T., Compound Poisson approximation for counts of rare patterns in Markov chains and extreme sojourns in birth-death chains, Ann. Appl. Probab., 10, 2, 573-591, (2000) · Zbl 1063.60007 [16] Fiduccia, C. M., An efficient formula for linear recurrences, SIAM J. Comput., 14, 1, 106-112, (1985) · Zbl 0561.68033 [17] Frith, M. C.; Spouge, J. L.; Hansen, U.; Weng, Z., Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences, Nucl. Acids. Res., 30, 14, 3214-3224, (2002) [18] Fu, J. C., Distribution theory of runs and patterns associated with a sequence of multi-state trials, Statist. Sinica, 6, 4, 957-974, (1996) · Zbl 0857.60068 [19] Geske, M. X.; Godbole, A. P.; Schaffner, A. A.; Skrolnick, A. M.; Wallstrom, G. L., Compound Poisson approximations for word patterns under Markovian hypotheses, J. Appl. Probab., 32, 877-892, (1995) · Zbl 0843.60025 [20] Godbole, A. P., Poisson approximations for runs and patterns of rare events, Adv. Appl. Probab., 23, (1991) · Zbl 0751.60018 [21] Hampson, S.; Kibler, D.; Baldi, P., Distribution patterns of over-represented k-mers in non-coding yeast DNA, Bioinformatics, 18, 4, 513-528, (2002) [22] Hopcroft, J., An $$n \log n$$ algorithm for minimizing states in a finite automaton, Reproduction, 189-196, (1971) [23] Hopcroft, J. E.; Motwani, R.; Ullman, J. D., Introduction to the automata theory, languages, and computation, (2001), ACM Press New York · Zbl 0980.68066 [24] Kaltofen, E.; Yang, Z., On exact and approximate interpolation of sparse rational functions, (Brown, C. W., Proceedings of the 2007 ACM International Symposium on Symbolic and Algebraic Computation, Waterloo, Canada, July 29-August 1, (2007), ACM Press New York), 203-210 · Zbl 1190.65019 [25] Karlin, S.; Burge, C.; Campbell, A., Statistical analyses of counts and distributions of restriction sites in DNA sequences, Nucl. Acids. Res., 20, 6, 1363-1370, (1992) [26] Kleffe, J.; Borodovski, M., First and second moments of counts of words in random texts generated by Markov chains, Bioinformatics, 8, 5, 433-441, (1997) [27] Knuth, D. E., (Seminumerical Algorithms, The Art of Computer Programming, vol. 2, (1997), Addison-Wesley Reading, MA, USA) [28] Le Maout, V., Regular expressions at their best: a case for rational design, Implementation and Application of Automata, 310-320, (2011) · Zbl 1297.68150 [29] Lladser, M. E., Mininal Markov chain embeddings of pattern problems, (Information Theory and Applications Workshop, (2007), IEEE), 251-255 [30] (Lothaire, M., Applied Combinatorics on Words, (2005), Cambridge University Press Cambridge) · Zbl 1133.68067 [31] Mariño-Ramírez, L.; Spouge, J. L.; Kanga, G. C.; Landsman, D., Statistical analysis of over-represented words in human promoter sequences, Nucl. Acids Res., 32, 3, 949-958, (2004) [32] Nicodème, P.; Salvy, B.; Flajolet, P., Motif statistics, Theoret. Comput. Sci., 287, 2, 593-617, (2002) · Zbl 1061.68118 [33] Nuel, G., LD-spatt: large deviations statistics for patterns on Markov chains, J. Comput. Biol., 11, 6, 1023-1033, (2004) [34] Nuel, G., Effective $$p$$-value computations using finite Markov chain imbedding (FMCI): application to local score and to pattern statistics, Algorithms Mol. Biol., 1, 1, 5, (2006) [35] Nuel, G., Numerical solutions for patterns statistics on Markov chains, Stat. Appl. Genet. Mol. Biol., 5, 1, 26, (2006) · Zbl 1166.62324 [36] Nuel, G., Pattern statistics on Markov chains and sensitivity to parameter estimation, Algorithms Mol. Biol., 1, 1, 17, (2006) [37] Nuel, G., Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata, J. Appl. Probab., 45, 1, 226-243, (2008) · Zbl 1142.65010 [38] Nuel, G., Significance score of motifs in biological sequences, (Mahdavi, D. M. A., Bioinformatics: Trends and Methodologies. Intech, (2011)), (Chapter 9). Available at http://www.intechopen.com/books/bioinformatics-trends-and-methodologies/significance-score-of-motifs-in-biological-sequences [39] Nuel, G.; Regad, L.; Martin, J.; Camproux, A.-C., Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data, Algorithms Mol. Biol., 5, 15, (2010) [40] Pevzner, P.; Borodovski, M.; Mironov, A., Linguistic of nucleotide sequences: the significance of deviation from mean statistical characteristics and prediction of frequencies of occurrence of words, J. Biomol. Struct. Dyn., 6, 1013-1026, (1989) [41] Prum, B.; Rodolphe, F.; de Turckheim, E., Finding words with unexpected frequencies in DNA sequences, J. R. Stat. Soc. Ser. B, 11, 190-192, (1995) [42] Régnier, M., A unified approach to word occurrences probabilities, Discrete Appl. Math., 104, 1, 259-280, (2000) · Zbl 0987.92017 [43] Reinert, G.; Schbath, S., Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains, J. Comput. Biol., 5, 223-254, (1999) [44] Ribeca, P.; Raineri, E., Faster exact Markovian probability functions for motif occurrences: a DFA-only approach, Bioinformatics, 24, 24, 2839-2848, (2008) [45] B. Salvy, Solutions rationnelles de systèmes linéaires à coefficients polynomiaux, in: Algorithmes en Calcul Formel et en Automatique, 2009 (Chapter 7). http://algo.inria.fr/chyzak/mpri/poly-20091201.pdf. [46] Stefanov, V.; Pakes, A. G., Explicit distributional results in pattern formation, Ann. Appl. Probab., 7, 666-678, (1997) · Zbl 0893.60005 [47] Stefanov, V. T.; Szpankowski, W., Waiting time distributions for pattern occurrence in a constrained sequence, Discrete Math. Theor. Comput. Sci., 9, 1, 305-320, (2007) · Zbl 1152.68475 [48] Storjohann, A., High-order lifting and integrality certification, J. Symbolic Comput., 36, 3-4, 613-648, (2003) · Zbl 1059.15020 [49] van Helden, J.; André, B.; Collado-Vides, J., Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies, J. Mol. Biol., 281, 5, 827-842, (1998)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.