Partially local multi-way alignments. (English) Zbl 1403.68377

Summary: Multiple sequence alignments are an essential tool in bioinformatics and computational biology, where they are used to represent the mutual evolutionary relationships and similarities between a set of DNA, RNA, or protein sequences. More recently they have also received considerable interest in other application domains, in particular in comparative linguistics. Multiple sequence alignments can be seen as a generalization of the string-to-string edit problem to more than two strings. With the increase in the power of computational equipment, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are considered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Here we introduce a general formal framework that gives raise to a classification of partially local alignment problems. This leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems.


68W32 Algorithms on strings
68Q42 Grammars and rewriting systems
Full Text: DOI


[1] Abboud, A., Backurs, A., Williams, V.V.: Tight hardness results for LCS and other sequence similarity measures. In: 56th Annual Symposium on Foundations of Computer Science (FOCS), pp. 59-78. IEEE (2015)
[2] Abboud, A., Hansen, T.D., Williams, V.V., Williams, R.: Simulating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made. In: Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing (STOC ’16), New York, NY, pp. 375-388 (2016) · Zbl 1373.68233
[3] Al Ait, L; Yamak, Z; Morgenstern, B; Morgenstern, B, DIALIGN at GOBICS—multiple sequence alignment using various sources of external information, Nucleic Acids Res., 41, w3-w7, (2013)
[4] Al Arab, M; Bernt, M; Siederdissen, CH; Tout, K, Partially local three-way alignments and the sequence signatures of mitochondrial genome rearrangements, Alg. Mol. Biol., 12, 22, (2017)
[5] Altschul, SF, Gap costs for multiple sequence alignment, J. Theor. Biol., 138, 297-309, (1989)
[6] Altschul, SF; Gish, W; Miller, W; Myers, EW; Lipman, DJ, Basic local alignment search tool, J. Mol. Biol., 215, 403-410, (1990)
[7] Angermüller, C; Biegert, A; Söding, J, Discriminative modelling of context-specific amino acid substitution probabilities, Bioinformatics, 28, 3240-3247, (2012)
[8] Arlazarov, V; Dinic, E; Kronrod, M; Faradzev, I, On economical construction of the transitive closure of a directed graph, Dokl. Akad. Nauk., 11, 1209-1210, (1970) · Zbl 0214.23601
[9] Baichoo, S; Ouzounis, CA, Computational complexity of algorithms for sequence comparison, short-Read assembly and genome alignment, Biosystems, 156, 72-85, (2017)
[10] Bailey, TL; Williams, N; Misleh, C; Wilfred, WL, MEME: discovering and analyzing DNA and protein sequence motifs, Nucleic Acids Res., 34, w369-w373, (2006)
[11] Bhattacharya, T., Retzlaff, N., Blasi, D., Croft, W., Cysouw, M., Hruschka, D., Maddieson, I., Müller, L., Smith, E., Stadler, P.F., Starostin, G., Youn, H.: Studying language evolution in the age of big data. J. Lang. Evol. (2018)
[12] Blanchette, M, Computation and analysis of genomic multi-sequence alignments, Annu. Rev. Genomics Hum. Genet., 8, 193-213, (2007)
[13] Blanchette, M; Tompa, M, Footprinter: a program designed for phylogenetic footprinting, Nucleic Acids Res., 31, 3840-3842, (2003)
[14] Blanchette, M; Schwikowski, B; Tompa, M, Algorithms for phylogenetic footprinting, J. Comput. Biol., 9, 211-223, (2002)
[15] Bonizzoni, P; Della Vedova, G, The complexity of multiple sequence alignment with SP-score that is a metric, Theor. Comput. Sci., 259, 63-79, (2001) · Zbl 0972.68092
[16] Brudno, M; Chapman, M; Göttgens, B; Batzoglou, S; Morgenstern, B, Fast and sensitive multiple alignment of large genomic sequences, BMC Bioinform., 4, 66, (2003)
[17] Bucher, P; Hoffmann, K; States, DJ (ed.); Agarwal, P (ed.); Gaasterland, T (ed.); Hunter, L (ed.); Smith, RF (ed.), A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system, 44-50, (1996), Menlo Park, CA
[18] Bussotti, G; Raineri, E; Erb, I; Zytnicki, M; Wilm, A; Beaudoing, E; Bucher, P; Notredame, C, Blastr-fast and accurate database searches for non-coding rnas, Nucleic Acids Res., 39, 6886-6895, (2011)
[19] Carrillo, H; Lipman, D, The multiple sequence alignment problem in biology, SIAM J. Appl. Math., 48, 1073-1082, (1988) · Zbl 0652.92001
[20] Chowdhurya, B; Garaib, G, A review on multiple sequence alignment from the perspective of genetic algorithm, Genomics, 109, 419-431, (2017)
[21] Collingridge, PW; Kelly, S, Mergealign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments, BMC Bioinform., 13, 117, (2012)
[22] Corel, E; Pitschi, F; Morgenstern, B, A MIN-cut algorithm for the consistency problem in multiple sequence alignment, Bioinformatics, 26, 1015-1021, (2010)
[23] Cysouw, M., Jung, H.: Cognate identification and alignment using practical orthographies. In: Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology, pp. 109-116. Association for Computational Linguistics (2007)
[24] Dewey, TG, A sequence alignment algorithm with an arbitrary gap penalty function, J. Comput. Biol., 8, 177-190, (2001)
[25] Dowell, RD; Eddy, SR, Evaluation of several lightweight stochastic context free grammars for RNA secondary structure prediction, BMC Bioinform., 5, 71, (2004)
[26] Eddy, SR, A probabilistic model of local sequence alignment that simplifies statistical significance estimation, PLoS Comput. Biol., 4, e1000069, (2008)
[27] Elias, I, Settling the intractability of multiple alignment, J. Comput. Biol., 13, 1323-1339, (2006)
[28] Fonseca, NA; Rung, J; Brazma, A; Marioni, JC, Tools for mapping high-throughput sequencing data, Bioinformatics, 28, 3169-3177, (2012)
[29] Giegerich, R; Giancarlo, R (ed.); Sankoff, D (ed.), Explaining and controlling ambiguity in dynamic programming, No. 1848, 46-59, (2002), Berlin
[30] Giegerich, R; Meyer, C; Steffen, P, A discipline of dynamic programming over sequence data, Sci. Comput. Prog., 51, 215-263, (2004) · Zbl 1067.90157
[31] Gotoh, O, An improved algorithm for matching biological sequences, J. Mol. Biol., 162, 705708, (1982)
[32] Gotoh, O, Alignment of three biological sequences with an efficient traceback procedure, J. Theor. Biol., 121, 327-337, (1986)
[33] Gupta, SK; Kececioglu, JD; Schäffer, AA, Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment, J. Comput. Biol., 2, 459-472, (1995)
[34] Hertel, J; Jong, D; Marz, M; Rose, D; Tafer, H; Tanzer, A; Schierwater, B; Stadler, PF, Non-coding RNA annotation of the genome of trichoplax adhaerens, Nucleic Acids Res., 37, 1602-1615, (2009)
[35] Hirosawa, M; Totoki, Y; Hoshida, M; Ishikawa, M, Comprehensive study on iterative algorithms of multiple sequence alignment, Comput. Appl. Biosci., 11, 13-18, (1995)
[36] Hoffmann, S; Otto, C; Doose, G; Tanzer, A; Langenberger, D; Christ, S; Kunz, M; Holdt, LM; Teupser, D; Hackermüller, J; Stadler, PF, A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and fusion detection, Genome Biol., 15, r34, (2014)
[37] Hogeweg, P; Hesper, B, The alignment of sets of sequences and the construction of phyletic trees: an integrated method, J. Mol. Evol., 20, 175-186, (1984)
[38] James Kent, W, BLAT—the BLAST-like alignment tool, Genome Res., 12, 656-664, (2002)
[39] Jones, N.C., Pevzner, P.A.: An Introduction to Bioinformatics. MIT Press, Cambride (2004). Problem 6.22
[40] Just, W, Computational complexity of multiple sequence alignment with SP-score, J. Comput. Biol., 8, 615-623, (2001)
[41] Katoh, K; Standley, DM, MAFFT: iterative refinement and additional methods, Methods Mol. Biol., 1079, 131-146, (2014)
[42] Kececioglu, J.D.: The maximum weight trace problem in multiple sequence alignment. In: Proceedings of the 4th Symposium on Combinatorial Pattern Matching, volume 684 of Lecture Notes Computer Science, pp. 106-119. Springer, Berlin (1993)
[43] Kececioglu, J., Starrett, D.: Aligning alignments exactly. In: Bourne, P.E., Gusfield, D. (eds.) Proceedings of the 8th ACM Conference on Research in Computational Molecular Biology (RECOMB), pp. 85-96. ACM, New York, NY (2004)
[44] Kececioglu, JD; Lenhof, H-P; Mehlhorn, K; Mutzel, P; Reinert, K; Vingron, M, A polyhedral approach to sequence alignment problems, Discrete Appl. Math., 104, 143-186, (2000) · Zbl 0998.92017
[45] Konagurthu, AS; Whisstock, J; Stuckey, PJ, Progressive multiple alignment using sequence triplet optimization and three-residue exchange costs, J. Bioinform. Comput. Biol., 2, 719-745, (2004)
[46] Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of NAACL 2000 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 288-295. Morgan Kaufmann Publishers, San Francisco, CA (2000)
[47] Kondrak, G, Phonetic alignment and similarity, Comput. Humanit., 37, 273-291, (2003)
[48] Kruspe, M; Stadler, PF, Progressive multiple sequence alignments from triplets, BMC Bioinform., 8, 254, (2007)
[49] List, J-M; Greenhill, SJ; Gray, RD, The potential of automatic word comparison for historical linguistics, PLoS ONE, 12, e0170046, (2017)
[50] Lukashin, AV; Rosa, JJ, Local multiple sequence alignment using dead-end elimination, Bioinformatics, 15, 947-953, (1999)
[51] Manthey, B, Non-approximability of weighted multiple sequence alignment, Theor. Comput. Sci., 296, 179-192, (2003) · Zbl 1044.68160
[52] Margulies, EH; Blanchette, M; Haussler, D; Green, ED, Identification and characterization of multi-species conserved sequences, Genome Res., 13, 2507-2518, (2003)
[53] Meier, A; Söding, J, Context similarity scoring improves protein sequence alignments in the midnight zone, Bioinformatics, 31, 674-681, (2015)
[54] Miyazawa, S, A reliable sequence alignment method based on probabilities of residue correspondences, Protein Eng., 8, 999-1009, (1994)
[55] Morgenstern, B; Frech, K; Dress, A; Werner, T, DIALIGN: finding local similarities by multiple sequence alignment, Bioinformatics, 14, 290-294, (1998)
[56] Morgenstern, B., Stoye, J., Dress, A.W.M.: Consistent equivalence relations: a set-theoretical framework for multiple sequence alignments. Technical report, University of Bielefeld, FSPM (1999)
[57] Mückstein, U; Hofacker, IL; Stadler, PF, Stochastic pairwise alignments, Bioinformatics, 60, s153-s118, (2002)
[58] Needleman, SB; Wunsch, CD, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol., 48, 443-453, (1970)
[59] Notredame, C; Higgins, D; Heringa, J, T-coffee: a novel method for multiple sequence alignments, J. Mol. Biol., 302, 205-217, (2000)
[60] Otto, W., Stadler, P.F., Prohaska, S.J.: Phylogenetic footprinting and consistent sets of local alignments. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011, volume 6661 of Lecture Notes in Computer Science, pp. 118-131. Springer, Heidelberg (2011) · Zbl 1339.92059
[61] Ovcharenko, I; Loots, GG; Giardine, BM; Hou, M; Ma, J; Hardison, RC; Stubbs, L; Miller, W, Mulan: multiple-sequence local alignment and visualization for studying function and evolution, Genome Res., 15, 184-194, (2005)
[62] Overington, J; Donnelly, D; Johnson, MS; Šali, A; Blundell, TL, Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds, Protein Sci., 1, 216-226, (1992)
[63] Prohaska, S; Fried, C; Flamm, C; Wagner, G; Stadler, PF, Surveying phylogenetic footprints in large gene clusters: applications to hox cluster duplications, Mol. Phylogen. Evol., 31, 581-604, (2004)
[64] Prüfer, K; Stenzel, U; Hofreiter, M; Pääbo, S; Kelso, J; Green, RE, Computational challenges in the analysis of ancient DNA, Genome Biol., 11, r47, (2010)
[65] Rausch, T; Koren, S; Denisov, G; Weese, D; Emde, A-K; Döring, A; Reinert, K, A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads, Bioinformatics, 25, 1118-1124, (2009)
[66] Retzlaff, N.: A two-step scoring model for computational phylolinguistics. In: de Haan, Ronald (ed.), Proceedings of the ESSLLI 2014 Student Session, pp. 196-206. TU Wien, Vienna, A (2014). www.kr.tuwien.ac.at/drm/dehaan/stus2014/proceedings.pdf. Accessed 21 Feb 2018
[67] Sankoff, D, Simultaneous solution of the RNA folding, alignment and protosequence problems, SIAM J. Appl. Math., 45, 810-825, (1985) · Zbl 0581.92012
[68] Sankoff, D., Kruskal, J.B.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA (1983) · Zbl 0952.68143
[69] Setubal, J.C., Meidanis, J.: Introduction to Computational Molecular Biology. PWS Pub, Boston, MA (1997)
[70] Shigemizu, D., Maruyama, O.: Searching for regulatory elements of alternative splicing events using phylogenetic footprinting. In: Jonassen, I., Kim, J. (eds.) 4th International Workshop on Algorithms in Bioinformatics, vol. 3240, pp. 147-158. Springer, Heidelberg (2004)
[71] Sievers, F; Higgins, DG, Clustal omega for making accurate alignments of many protein sequences, Protein Sci., 27, 135-145, (2018)
[72] Smith, TF; Waterman, MS, Identification of common molecular subsequences, J. Mol. Biol., 147, 195-197, (1981)
[73] Steiner, L; Stadler, PF; Cysouw, M, A pipeline for computational historical linguistics, Lang. Dyn. Change, 1, 89-127, (2011)
[74] Tabei, Y; Asai, K, A local multiple alignment method for detection of non-coding RNA sequences, Bioinformatics, 25, 1498-1505, (2009)
[75] Thompson, JD; Koehl, P; Ripp, R; Poch, O, Balibase 3.0: latestdevelopments of the multiple sequence alignment benchmark, Proteins, 61, 127-136, (2005)
[76] Thompson, JD; Linard, B; Lecompte, O; Poch, O, A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives, PLoS ONE, 6, e18093, (2011)
[77] Wang, L; Jiang, T, On the complexity of multiple sequence alignment, J. Comput. Biol., 1, 337-348, (1994)
[78] Waterman, MS; Smith, TF; Beyer, WA, Some biological sequence metrics, Adv. Math., 20, 367-387, (1976) · Zbl 0342.92003
[79] Wheeler, TJ; Kececioglu, JD, Multiple alignment by aligning alignments, Bioinformatics, 23, i559-i568, (2007)
[80] Will, S; Missal, K; Hofacker, IL; Stadler, PF; Backofen, R, Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering, PLoS Comput. Biol., 3, e65, (2007)
[81] Will, S; Siebauer, MF; Heyne, S; Engelhardt, J; Stadler, PF; Reiche, K; Backofen, R, Locarnascan: incorporating thermodynamic stability in sequence and structure-based RNA homology search, Alg. Mol. Biol., 8, 14, (2013)
[82] Yamada, S; Gotoh, O; Yamana, H, Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost, BMC Bioinform., 7, 524, (2006)
[83] Yi-Kuo, Y; Hwa, T, Statistical significance of probabilistic sequence alignment and related local hidden Markov models, J. Comput. Biol., 8, 249-282, (2001)
[84] Zhang, Z; Gerstein, M, Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements, J. Biol., 2, 11, (2003)
[85] Siederdissen, CH; Hofacker, IL; Stadler, PF, Product grammars for alignment and folding, IEEE/ACM Trans. Comput. Biol. Bioinform., 12, 507-519, (2015)
[86] Siederdissen, CH; Prohaska, SJ; Stadler, PF, Algebraic dynamic programming over general data structures, BMC Bioinform., 16, 19:s2, (2015)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.