×

The rate of the convergence of the mean score in random sequence comparison. (English) Zbl 1244.60095

Summary: We consider a general class of superadditive scores measuring the similarity of two independent sequences of \(n\) i.i.d. letters from a finite alphabet. Our object of interest is the mean score by letter \(l_{n}\). By subadditivity \(l_{n}\) is nondecreasing and converges to a limit \(l\). We give a simple method of bounding the difference \(l - l_{n}\) and obtaining the rate of convergence. Our result generalizes the previous result of K. S. Alexander [Ann. Appl. Probab. 4, No. 4, 1074–1082 (1994; Zbl 0812.60014)], where only the special case of the longest common subsequence was considered.

MSC:

60K35 Interacting random processes; statistical mechanics type models; percolation theory
41A25 Rate of convergence, degree of approximation
60C05 Combinatorial probability

Citations:

Zbl 0812.60014
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Alexander, K. S. (1994). The rate of convergence of the mean length of the longest common subsequence. Ann. Appl. Probab. 4 1074-1082. · Zbl 0812.60014 · doi:10.1214/aoap/1177004903
[2] Alexander, K. S. (1997). Approximation of subadditive functions and convergence rates in limiting-shape results. Ann. Probab. 25 30-55. · Zbl 0882.60090 · doi:10.1214/aop/1024404277
[3] Arratia, R. and Waterman, M. S. (1994). A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Probab. 4 200-225. · Zbl 0809.62008 · doi:10.1214/aoap/1177005208
[4] Baeza-Yates, R. A., Gavaldà, R., Navarro, G. and Scheihing, R. (1999). Bounding the expected length of longest common subsequences and forests. Theory Comput. Syst. 32 435-452. · Zbl 0934.68043 · doi:10.1007/s002240000125
[5] Barron, A., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301-413. · Zbl 0946.62036 · doi:10.1007/s004400050210
[6] Booth, H. S., MacNamara, S. F., Nielsen, O. M. and Wilson, S. R. (2004). An iterative approach to determining the length of the longest common subsequence of two strings. Methodol. Comput. Appl. Probab. 6 401-421. · Zbl 1056.92044 · doi:10.1023/B:MCAP.0000045088.88240.3a
[7] Chvatal, V. and Sankoff, D. (1975). Longest common subsequences of two random sequences. J. Appl. Probab. 12 306-315. · Zbl 0313.60008 · doi:10.2307/3212444
[8] Cristianini, N. and Hahn, M. W. (2007). Introduction to Computational Genomics . Cambridge Univ. Press, Cambridge. · Zbl 1321.92005
[9] Dancik, V. (1994). Expected length of longest common subsequences. Ph.D. dissertation, Dept. Computer Science, Univ. Warwick.
[10] Dančík, V. and Paterson, M. (1995). Upper bounds for the expected length of a longest common subsequence of two binary sequences. Random Structures Algorithms 6 449-458. · Zbl 0839.68077 · doi:10.1002/rsa.3240060408
[11] Deken, J. G. (1979). Some limit results for longest common subsequences. Discrete Math. 26 17-31. · Zbl 0403.68064 · doi:10.1016/0012-365X(79)90057-8
[12] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics ( New York ) 31 . Springer, New York. · Zbl 0853.68150
[13] Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids . Cambridge Univ. Press, Cambridge. · Zbl 0929.92010
[14] Durringer, C., Hauser, R. and Matzinger, H. (2008). Approximation to the mean curve in the LCS problem. Stochastic Process. Appl. 118 629-648. · Zbl 1140.60014 · doi:10.1016/j.spa.2007.05.010
[15] Fu, J. C. and Lou, W. Y. W. (2008). Distribution of the length of the longest common subsequence of two multi-state biological sequences. J. Statist. Plann. Inference 138 3605-3615. · Zbl 1147.92010 · doi:10.1016/j.jspi.2007.03.063
[16] Kiwi, M., Loebl, M. and Matoušek, J. (2005). Expected length of the longest common subsequence for large alphabets. Adv. Math. 197 480-498. · Zbl 1087.68081 · doi:10.1016/j.aim.2004.10.012
[17] Lin, C. Y. and Och, F. J. (2004). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL’ 04: Proceedings of the 42 nd Annual Meeting on Association for Computational Linguistics 605. Association for Computational Linguistics, Stroudsburg, PA.
[18] Lueker, G. S. (2009). Improved bounds on the average length of longest common subsequences. J. ACM 56 Art. 17, 38. · Zbl 1094.68599 · doi:10.1145/1516512.1516519
[19] Melamed, I. D. (1995). Automatic evaluation and uniform filter cascades for inducing n -best translation lexicons. In Proceedings of the Third Workshop on Very Large Corpora . MIT, Boston.
[20] Melamed, I. D. (1999). Bitext maps and alignment via pattern recognition. Comput. Linguist. 107-130.
[21] Paterson, M. and Dančík, V. (1994). Longest common subsequences. In Mathematical Foundations of Computer Science 1994 ( KoŠice , 1994). Lecture Notes in Computer Science 841 127-142. Springer, Berlin. · doi:10.1007/3-540-58338-6_63
[22] Pevzner, P. A. (2000). Computational Molecular Biology : An Algorithmic Approach , a Bradford Book . MIT Press, Cambridge, MA. · Zbl 0972.92011
[23] Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Bio. 147 195-197.
[24] Waterman, M. S. (1995). Introduction to Computational Biology . Chapman & Hall, New York. · Zbl 0831.92011
[25] Waterman, M. S. and Vingron, M. (1994). Sequence comparison significance and Poisson approximation. Statist. Sci. 9 367-381. · Zbl 0955.92501 · doi:10.1214/ss/1177010382
[26] Yang, C. C. and Li, K. W. (2003). Automatic construction of english/chinese parallel corpora. Journal of the American Society for Information Science and Technology 54 730-742.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.