×

Microscopic path structure of optimally aligned random sequences. (English) Zbl 1464.60010

This paper studies the microscopic path structure of optimally aligned random sequences, for an alignment \(\pi\) with gaps of two sequences \(x=x_1,\cdots,x_n\) and \(y=y_1,\cdots,y_n\) of equal length \(n\). The vector \(p_{\pi}(x,y)\) of all such ratios collected in lexicographical order is the empirical distribution vector of \(\pi\). Let \(SET(x,y)=\{p_{\pi}(x,y)\}_{\pi}\) be the set of all empirical distribution vectors related to \(x\) and \(y\). For two independent random sequences \(X=X_1,\ldots, X_n\) and \(Y=Y_1,\ldots,Y_n\) with i.i.d. components. The convex hull of \(SET(X,Y)\) is given by \(SET^n=\mathrm{conv}(SET(X,Y))\). Given a scoring function \(S\), let \(f_S\) be some associated linear functional. It is shown that \(SET^n\) almost surely converges to the set \(SET\) in the topology of the Hausdorff distance as \(n\) tends to infinity. Moreover, the empirical distribution of all optimal alignments of \(X\) and \(Y\) almost surely converge to a deterministic distribution when the scoring function \(S\) is selected such that \(f_S\) has a unique maximizer in \(SET\). For Lebesgue-almost every scoring function \(S\), it is proved that the empirical distributions of all optimal alignments almost surely converge to some deterministic distribution.

MSC:

60C05 Combinatorial probability
60F10 Large deviations
60K35 Interacting random processes; statistical mechanics type models; percolation theory
90C25 Convex programming
60F15 Strong limit theorems

References:

[1] Alexander, K.S. (1994). The rate of convergence of the mean length of the longest common subsequence. Ann. Appl. Probab. 4 1074-1082. · Zbl 0812.60014 · doi:10.1214/aoap/1177004903
[2] Amsalu, S., Hauser, R. and Matzinger, H. (2014). A Monte Carlo approach to the fluctuation problem in optimal alignments of random strings. Markov Process. Related Fields 20 107-144. · Zbl 1308.60105
[3] Baxevanis, A.D. and Ouellette, B.F.F. (2005). Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. New York: Wiley.
[4] Bertsekas, D.P. and Yu, H. (2011). A unifying polyhedral approximation framework for convex optimization. SIAM J. Optim. 21 333-360. · Zbl 1218.90154 · doi:10.1137/090772204
[5] Borwein, J.M. and Lewis, A.S. (2000). Convex Analysis and Nonlinear Optimization: Theory and Examples. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC 3. New York: Springer. · Zbl 0953.90001
[6] Boutet de Monvel, J. (1999). Extensive simulations for longest common subsequences. Eur. Phys. J. B 7 293-308.
[7] Carathéodory, C. (1911). Über den Variabilitätsbereich der Fourierschen Konstanten von positiven harmonischen Funktionen. Rend. Circ. Mat. Palermo (2) 32 193-217. · JFM 42.0429.01
[8] Chvatal, V. and Sankoff, D. (1975). Longest common subsequences of two random sequences. J. Appl. Probab. 12 306-315. · Zbl 0313.60008 · doi:10.2307/3212444
[9] Durringer, C., Hauser, R. and Matzinger, H. (2008). Approximation to the mean curve in the LCS problem. Stochastic Process. Appl. 118 629-648. · Zbl 1140.60014 · doi:10.1016/j.spa.2007.05.010
[10] Fekete, M. (1923). Über die Verteilung der Wurzeln bei gewissen algebraischen Gleichungen mit ganzzahligen Koeffizienten. Math. Z. 17 228-249. · JFM 49.0047.01 · doi:10.1007/BF01504345
[11] Gruber, P.M. (1993). History of convexity. In Handbook of Convex Geometry, Vol. A, B (P.M. Gruber, and J.M. Wills, eds.) 1-15. Amsterdam: North-Holland. · Zbl 0791.52001
[12] Hausdorff, F. (1949). Grundzüge der Mengenlehre. New York, NY: Chelsea Publishing Company. · Zbl 0041.02002
[13] Hauser, R., Martínez, S. and Matzinger, H. (2006). Large deviations-based upper bounds on the expected relative length of longest common subsequences. Adv. in Appl. Probab. 38 827-852. · Zbl 1101.60016 · doi:10.1239/aap/1158685004
[14] Hauser, R. and Matzinger, H. (2013). Letter change bias and local uniqueness in optimal sequence alignments. J. Stat. Phys. 153 512-529. · Zbl 1315.60011 · doi:10.1007/s10955-013-0819-4
[15] Hauser, R., Matzinger, H. and Popescu, I. (2018). An upper bound on the convergence rate of a second functional in optimal sequence alignment. Bernoulli 24 971-992. · Zbl 1429.60017 · doi:10.3150/16-BEJ823
[16] Hauser, R.A. and Matzinger, H. (2020). Supplement to “Microscopic path structure of optimally aligned random sequences.” https://doi.org/10.3150/18-BEJ1053SUPP. · Zbl 1464.60010
[17] Hirmo, E., Lember, J. and Matzinger, H. (2012). Detecting the homology of DNA-sequences based on the variety of optimal alignments: A case study. Available at arXiv:1210.3771.
[18] Howard, C.D. (2004). Probability on discrete structures. In Encyclopaedia of Mathematical Sciences (H. Kesten, ed.) 110. Berlin: Springer. · Zbl 1025.60001
[19] Jerrum, M. (2003). Counting, Sampling and Integrating: Algorithms and Complexity. Lectures in Mathematics ETH Zürich. Basel: Birkhäuser. · Zbl 1011.05001
[20] Karlin, S. and Altschul, S.F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87 2264-2268. · Zbl 0695.92004 · doi:10.1073/pnas.87.6.2264
[21] Kingman, J.F.C. (1973). Subadditive ergodic theory. Ann. Probab. 1 883-909. · Zbl 0311.60018 · doi:10.1214/aop/1176996798
[22] Krein, M. and Milman, D. (1940). On extreme points of regular convex sets. Studia Math. 9 133-138. · Zbl 0063.03360 · doi:10.4064/sm-9-1-133-138
[23] Lember, J. and Matzinger, H. (2009). Standard deviation of the longest common subsequence. Ann. Probab. 37 1192-1235. · Zbl 1182.60004 · doi:10.1214/08-AOP436
[24] McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics, 1989 (Norwich, 1989). London Mathematical Society Lecture Note Series 141 148-188. Cambridge: Cambridge Univ. Press. · Zbl 0712.05012
[25] Mil’man, D. (1947). Characteristics of extremal points of regularly convex sets. Dokl. Akad. Nauk SSSR 57 119-122. · Zbl 0029.14002
[26] Minkowski, H. (1968). Geometrie der Zahlen. Bibliotheca Mathematica Teubneriana, Band 40. New York: Johnson Reprint Corp. · Zbl 0958.11501
[27] Needleman, S.B. and Wunsch, C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48 443-453.
[28] Rademacher, H. (1919). Über partielle und totale differenzierbarkeit von Funktionen mehrerer Variabeln und über die Transformation der Doppelintegrale. Math. Ann. 79 340-359. · JFM 47.0243.01
[29] Rockafellar, R.T. (1997). Convex Analysis. Princeton Landmarks in Mathematics. Princeton, NJ: Princeton Univ. Press. · Zbl 0932.90001
[30] Steele, J.M. (1986). An Efron-Stein inequality for nonsymmetric statistics. Ann. Statist. 14 753-758. · Zbl 0604.62017 · doi:10.1214/aos/1176349952
[31] Steele, J.M. (1989). Kingman’s subadditive ergodic theorem. Ann. Inst. Henri Poincaré Probab. Stat. 25 93-98. · Zbl 0669.60039
[32] Steele, J.M. (1997). Probability Theory and Combinatorial Optimization. CBMS-NSF Regional Conference Series in Applied Mathematics 69. Philadelphia, PA: SIAM. · Zbl 0916.90233
[33] Waterman, M. (1994). Estimating statistical significance of sequence alignments. Philos. Trans. R. Soc. Lond. B, Biol. Sci. 344 383-390.
[34] Waterman, M. (1995). Introduction to Computational Biology. London: Chapman & Hall. · Zbl 0831.92011
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.