zbMATH — the first resource for mathematics

A sum-over-paths extension of edit distances accounting for all sequence alignments. (English) Zbl 1209.68454
Summary: This paper introduces a simple Sum-over-Paths (SoP) formulation of string edit distances accounting for all possible alignments between two sequences, and extends related previous work from bioinformatics to the case of graphs with cycles. Each alignment \(\wp\), with a total cost \(C(\wp)\), is assigned a probability of occurrence \(P(\wp)=\exp[-\theta C(\wp)]/\mathcal Z\) where \(\mathcal Z\) is a normalization factor. Therefore, good alignments (having a low cost) are favored over bad alignments (having a high cost). The expected cost \(\sum_{\wp \in \mathcal P}C(\wp)\exp[-\theta C(\wp)]/\mathcal Z\) computed over all possible alignments \(\wp \in \mathcal P\) defines the SoP edit distance. When \(\theta \rightarrow \infty \), only the best alignments matter and the measure reduces to the standard edit distance. The rationale behind this definition is the following: for some applications, two sequences sharing many good alignments should be considered as more similar than two sequences having only one single good, optimal, alignment in common. In other words, sub-optimal alignments could also be taken into account. Forward/backward recurrences allowing to efficiently compute the expected cost are developed. Virtually any Viterbi-like sequence comparison algorithm computed on a lattice can be generalized in the same way; for instance, a SoP longest common subsequence is also developed. Pattern classification tasks performed on five data sets show that the new measures usually outperform the standard ones and, in any case, never perform significantly worse, at the expense of tuning the parameter \(\theta \).

68T10 Pattern recognition, speech recognition
68R10 Graph theory (including graph drawing) in computer science
68W32 Algorithms on strings
Full Text: DOI
[1] Y. Achbany, F. Fouss, L. Yen, A. Pirotte, M. Saerens, Optimal tuning of continual exploration in reinforcement learning, in: Proceedings of the 16th International Conference on Artificial Neural Networks (ICANN 06), Lecture Notes in Computer Science, vol. 4131, 2006, pp. 734-749.
[2] Achbany, Y.; Fouss, F.; Yen, L.; Pirotte, A.; Saerens, M., Tuning continual exploration in reinforcement learning: an optimality property of the Boltzmann strategy, Neurocomputing, 71, 2507-2520, (2008)
[3] Akamatsu, T., Cyclic flows, Markov process and stochastic traffic assignment, Transportation research B, 30, 5, 369-386, (1996)
[4] Juan-Carlos Amengual and Enrique Vidal. On the estimation of error-correcting parameters, in: Proceedings of the International Conference on Pattern Recognition (ICPR), 2000, pp. 2883-2886.
[5] Arlazarov, V.L.; Dinic, E.A.; Kronod, M.A.; Faradzev, I.A., On economical construction of the transitive closure of an oriented graph, Doklady akademii nauk SSSR, 194, 487-488, (1970)
[6] Bahl, L.R.; Jelinek, F., Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition, IEEE transactions on information theory, 21, 4, 404-411, (1975) · Zbl 0309.94037
[7] F. Beuvens, T. Dullier, Développement et expérimentation d’une plate-forme de reconnaissance de gestes par stylet, Master’s Thesis, Université Catholique de Louvain, 2009.
[8] Borg, I.; Groenen, P., Modern multidimensional scaling: theory and applications, (1997), Springer New York · Zbl 0862.62052
[9] Boyd, S.; Diaconis, P.; Xiao, L., Fastest mixing Markov chain on a graph, SIAM review, 667-689, (2004) · Zbl 1063.60102
[10] P. Bucher, K. Hofmann, A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system, in: Proceedings of the 4th International Conference on Intelligent Systems for Molecular Biology (ISMB 1996), AAAI, 1996.
[11] Cancedda, N.; Gaussier, E.; Goutte, C.; Renders, J.M., Word sequence kernels, Journal of machine learning research, 3, 1059-1082, (2003) · Zbl 1061.68563
[12] Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C., Introduction to algorithms, (2000), MIT Press, McGraw-Hill Book Company Cambridge, Massachusetts
[13] Cover, T.M.; Thomas, J.A., Elements of information theory, (1991), John Wiley & Sons New Jersey · Zbl 0762.94001
[14] Cox, T.; Cox, M., Multidimensional scaling, (2001), Chapman and Hall Boca Raton, Floride · Zbl 1004.91067
[15] J.-C. Delvenne, A.-S. Libert, Centrality measures and thermodynamic formalism for complex networks, Manuscript, submitted for publication.
[16] C.B. Do, S.S. Gross, S. Batzoglou, CONTRAlign: discriminative training for protein sequence alignment, in: Proceedings of the 10th Annual International Conference on Computational Molecular Biology (RECOMB 2006), 2006. · Zbl 1302.92098
[17] Durbin, R.; Eddy, S.R.; Krogh, A.; Mitchison, G., Biological sequence analysis: probabilistic models of proteins and nucleic acids, (1998), Cambridge University Press Cambridge · Zbl 0929.92010
[18] Ekroot, L.; Cover, T., The entropy of Markov trajectories, IEEE transactions on information theory, 39, 4, 1418-1421, (1993) · Zbl 0802.60067
[19] Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; Lin, C.-J., LIBLINEAR: a library for large linear classification, Journal of machine learning research, 9, 1871-1874, (2008) · Zbl 1225.68175
[20] Fornay, G.D., The viterbi algorithm, Proceedings of the IEEE, 61, 3, 268-278, (1973)
[21] Girardin, V., Entropy minimization for Markov and semi-Markov processes, Methodology and computing in applied probability, 6, 109-127, (2004) · Zbl 1043.60080
[22] Girardin, V.; Limnios, N., Entropy rate and maximum entropy methods for countable semi-Markov chains, Communications in statistics, 33, 3, 609-622, (2004) · Zbl 1114.60069
[23] Gusfield, D., Algorithms on strings, trees, and sequences, (1997), Cambridge University Press Cambridge · Zbl 0934.68103
[24] Huang, X.; Ariki, Y.; Jack, M., Hidden Markov models for speech recognition, (1990), Columbia University Press New York, NY, USA
[25] Hwa, T.; Lässig, M., Similarity detection and localization, Physical review letters, 76, 2591, (1996)
[26] Jaakkola, T.; Diekhans, M.; Haussler, D., A discriminative framework for detecting remote protein homologies, Journal of computational biology, 7, 1-2, 95-114, (2000)
[27] Jurafsky, D.; Martin, J.H., Speech and language processing, (2008), Prentice Hall Upper Saddle River, New Jersey
[28] Krogh, A.; Brown, M.; Mian, I.S.; Sjölander, K.; Haussler, D., Hidden Markov models in computational biology: applications to protein modeling, Journal of molecular biology, 235, 5, 1501-1531, (1994)
[29] Kruskal, J.B., An overview of sequence comparison: time warps, string edits, and macromolecules, SIAM review, 25, 201-237, (1983) · Zbl 0512.68048
[30] Kschischo, M.; Lässig, M., Finite-temperature sequence alignment, Pacific symposium on biocomputing, 5, 624-635, (2000)
[31] Leslie, C.S.; Eskin, E.; Cohen, A.; Weston, J.; Stafford Noble, W., Mismatch string kernels for discriminative protein classification, Bioinformatics, 20, 4, 467-476, (2004)
[32] C.S. Leslie, E. Eskin, W.S. Noble, The spectrum kernel: a string kernel for SVM protein classification, in: Pacific Symposium on Biocomputing, 2002, pp. 566-575.
[33] Liao, L.; Stafford Noble, W., Combining pairwise sequence similarity and support vector machines for remote protein homology detection, (), 225-232
[34] M.L. Littman, Markov games as a framework for multi-agent reinforcement learning, in: Proceedings of the 11th International Conference on Machine Learning (ICML-94), 1994, pp. 157-163.
[35] Lodhi, H.; Saunders, C.; Shawe-Taylor, J.; Cristianini, N.; Watkins, C.; Scholkopf, B., Text classification using string kernels, Journal of machine learning research, 2, 563-569, (2002)
[36] Mantrach, A.; Yen, L.; Callut, J.; Françoisse, K.; Shimbo, M.; Saerens, M., The sum-over-paths covariance kernel: a novel covariance measure between nodes of a directed graph, IEEE transactions pattern analysis and machine intelligence, 32, 6, 1112-1126, (2010)
[37] Marzal, A.; Vidal, E., Computation of normalized edit distance and applications, IEEE transactions on pattern analysis and machine intelligence, 15, 9, 926-932, (1993)
[38] Miyazawa, S., A reliable sequence alignment method based on probabilities of residue correspondences, Protein engineering, 8, 999-1009, (1995)
[39] Navarro, G., A guided tour to approximate string matching, ACM computing surveys, 33, 1, 31-88, (2001)
[40] Newberg, L.A.; Lawrence, C.E., Exact calculation of distributions on integers, with application to sequence alignment, Journal of computational biology: A journal of computational molecular cell biology, 16, 1, 1-18, (2009)
[41] Oncina, J.; Sebban, M., Learning stochastic edit distance: application in handwritten character recognition, Pattern recognition, 39, 9, 1575-1587, (2006) · Zbl 1096.68724
[42] Osborne, M.J., An introduction to game theory, (2004), Oxford University Press Oxford
[43] Rabiner, L.; Hwang Juang, B., Fundamentals of speech recognition, (April 1993), Prentice Hall PTR Englewood Cliffs, New Jersey
[44] Rabiner, L., Readings in speech recognition. chapter A tutorial on hidden Markov models and selected applications in speech recognition, (1990), Morgan Kaufmann Publishers Inc., pp. 267-296
[45] E. Ricci, T. De Bie, N. Cristianini, Learning to align: a statistical approach, in: Proceedings of the 7th International Symposium on Intelligent Data Analysis (IDA 2007), Ljubljana, 2007.
[46] Ristad, E.; Yianilos, P., Learning string-edit distance, IEEE transactions on pattern analysis and machine intelligence, 20, 5, 522-532, (1998)
[47] Sven Ristad, E.; Yianilos, P.N., Learning string-edit distance, IEEE transactions on pattern analysis and machine intelligence, 20, 5, 522-532, (1998)
[48] Rousu, J.; Shawe-Taylor, J., Efficient computation of gapped substring kernels on large alphabets, Journal of machine learning research, 6, 1323-1344, (2005) · Zbl 1222.68292
[49] Saerens, M.; Achbany, Y.; Fouss, F.; Yen, L., Randomized shortest-path problems: two related models, Neural computation, 21, 8, 2363-2404, (2009) · Zbl 1192.90031
[50] Saigo, H.; Vert, J.-P.; Ueda, N.; Akutsu, T., Protein homology detection using string alignment kernels, Bioinformatics, 20, 11, 1682-1689, (2004)
[51] Sankoff, D.; Kruskal, J.B., Time warps, string edits, and macromolecules, (1983), Addison-Wesley Stanford, California · Zbl 0512.68048
[52] Shawe-Taylor, J.; Cristianini, N., Kernel methods for pattern analysis, (2004), Cambridge University Press Cambridge
[53] Steel, M.; Hein, J., Applying the thorne – kishino – felsenstein model to sequence evolution on a star-shaped tree, Applied mathematics letters, 14, 679-684, (2001) · Zbl 0999.92028
[54] Stephen, G., String searching algorithms, (1994), World Scientific Singapore · Zbl 0831.68028
[55] Sun, J.; Boyd, S.; Xiao, L.; Diaconis, P., The fastest mixing Markov process on a graph and a connection to a maximum variance unfolding problem, SIAM review, 681-699, (2006) · Zbl 1109.60324
[56] A. Tahbaz, A. Jadbabaie, A one-parameter family of distributed consensus algorithms with boundary: from shortest paths to mean hitting times, in: Proceedings of IEEE Conference on Decision and Control, 2006, pp. 4664-4669.
[57] Theodoridis, S.; Koutroumbas, K., Pattern recognition, (2006), Academic Press · Zbl 1093.68103
[58] Thorne, J.L.; Kishino, H.; Felsenstein, J., An evolutionary model for maximum likelihood alignment of DNA sequences, Journal of molecular evolution, 33, 114-124, (1991)
[59] Thorne, J.L.; Kishino, H.; Felsenstein, J., Inching toward reality: an improved likelihood model of sequence evolution, Journal of molecular evolution, 34, 3-16, (1992)
[60] Todorov, E., Linearly-solvable Markov decision problems, (), 1369-1375
[61] J. Tomlin, A new paradigm for ranking pages on the world wide web, in: Proceedings of the International World Wide Web Conference (WWW2003), 2003, pp. 350-355.
[62] Vidal, E.; Marzal, A.; Aibar, P., Fast computation of normalized edit distances, IEEE transactions on pattern analysis and machine intelligence, 17, 9, 899-902, (1995)
[63] Vishwanathan, S.V.N.; Borgwardt, K.M.; Kondor, R.I.; Schraudolph, N.N., Graph kernels, Journal of machine learning research, 11, 1201-1242, (2010) · Zbl 1242.05112
[64] Vishwanathan, S.V.N.; Borgwardt, K.M.; Schraudolph, N.N., Fast computation of graph kernels, ()
[65] Vishwanathan, S.V.N.; Smola, A.J., Fast kernels for string and tree matching, (), 569-576
[66] Wagner, R.A.; Fischer, M.J., The string-to-string correction problem, Journal of the ACM, 21, 1, 168-173, (1974) · Zbl 0278.68032
[67] Waterman, M.S.; Eggert, M., A new algorithm for best subsequence alignments with application to trna-rrna comparisons, Journal of molecular biology, 197, 4, 723-728, (1987)
[68] C. Watkins, Kernels from matching operations, Technical Report, Department of Computer Science, Royal Holloway, University of London, 1999.
[69] L. Yen, A. Mantrach, M. Shimbo, M. Saerens, A family of dissimilarity measures between nodes generalizing both the shortest-path and the commute-time distances, in: Proceedings of the 14th SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 785-793.
[70] Zhang, M.Q.; Marr, T.G., Alignment of molecular sequences seen as random path analysis, Journal of theoretical biology, 174, 2, 119-129, (1995)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.