Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis. (English) Zbl 1377.92060

Summary: We present an efficient phylogenetic reconstruction algorithm allowing insertions and deletions which provably achieves a sequence-length requirement (or sample complexity) growing polynomially in the number of taxa. Our algorithm is distance-based, that is, it relies on pairwise sequence comparisons. More importantly, our approach largely bypasses the difficult problem of multiple sequence alignment.


92D10 Genetics and epigenetics
60J85 Applications of branching processes
60K35 Interacting random processes; statistical mechanics type models; percolation theory
Full Text: DOI arXiv Euclid


[1] Andoni, A., Braverman, M. and Hassidim, A. (2011). Phylogenetic reconstruction with insertions and deletions.
[2] Andoni, A., Daskalakis, C., Hassidim, A. and Roch, S. (2010). Global alignment of molecular sequences via ancestral state reconstruction. In ICS 2010 358-369. Tsinghua University Press, Beijing, China. · Zbl 1250.92034
[3] Athreya, K. B. and Ney, P. E. (1972). Branching Processes. Die Grundlehren der Mathematischen Wissenschaften 196 . Springer, New York. · Zbl 0259.60002
[4] Atteson, K. (1999). The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25 251-278. · Zbl 0938.68747 · doi:10.1007/PL00008277
[5] Csurös, M. (2002). Fast recovery of evolutionary trees with thousands of nodes. J. Comput. Biol. 9 277-297.
[6] Csurös, M. and Kao, M.-Y. (2001). Provably fast and accurate recovery of evolutionary trees through harmonic greedy triplets. SIAM J. Comput. 31 306-322 (electronic). · Zbl 0987.05042 · doi:10.1137/S009753970037905X
[7] Daskalakis, C., Hill, C., Jaffe, A., Mihaescu, R., Mossel, E. and Rao, S. (2006). Maximal accurate forests from distance matrices. In RECOMB 2006 281-295. Springer, Berlin. · Zbl 1215.92048 · doi:10.1007/11732990_24
[8] Daskalakis, C., Mossel, E. and Roch, S. (2006). Optimal phylogenetic reconstruction. In STOC’ 06: Proceedings of the 38 th Annual ACM Symposium on Theory of Computing 159-168. ACM, New York. · Zbl 1301.92054
[9] Daskalakis, C., Mossel, E. and Roch, S. (2009). Phylogenies without branch bounds: Contracting the short, pruning the deep. In RECOMB 2009 451-465. Springer, Berlin. · Zbl 1227.92042
[10] Daskalakis, C. and Roch, S. (2010). Alignment-free phylogenetic reconstruction. In RECOMB 2010 123-137. Springer, Berlin.
[11] Edgar, R. C. (2004). MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 1792-1797.
[12] Elias, I. (2006). Settling the intractability of multiple alignment. J. Comput. Biol. 13 1323-1339 (electronic). · doi:10.1089/cmb.2006.13.1323
[13] Erdős, P. L., Steel, M. A., Székely, L. A. and Warnow, T. J. (1999). A few logs suffice to build (almost) all trees. I. Random Structures Algorithms 14 153-184. · Zbl 0945.60004 · doi:10.1002/(SICI)1098-2418(199903)14:2<153::AID-RSA3>3.0.CO;2-R
[14] Erdős, P. L., Steel, M. A., Székely, L. A. and Warnow, T. J. (1999). A few logs suffice to build (almost) all trees. II. Theoret. Comput. Sci. 221 77-118. · Zbl 0933.68100 · doi:10.1016/S0304-3975(99)00028-6
[15] Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Syst. Biol. 27 401-410.
[16] Felsenstein, J. (2004). Inferring Phylogenies . Sinauer, New York.
[17] Graur, D. and Li, W. H. (1999). Fundamentals of Molecular Evolution , 2nd ed. Sinauer, Sunderland, MA.
[18] Gronau, I., Moran, S. and Snir, S. (2008). Fast and reliable reconstruction of phylogenetic trees with very short edges (extended abstract). In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms 379-388. ACM, New York. · Zbl 1192.05030
[19] Higgins, D. G. and Sharp, P. M. (1988). Clustal: A package for performing multiple sequence alignment on a microcomputer. Gene 73 237-244.
[20] Huson, D. H., Nettles, S. H. and Warnow, T. J. (1999). Disk-covering, a fast-converging method for phylogenetic tree reconstruction. J. Comput. Biol. 6 3-4. · Zbl 1066.92503
[21] Karlin, S. and Taylor, H. M. (1981). A Second Course in Stochastic Processes . Academic Press, New York. · Zbl 0469.60001
[22] Katoh, K., Misawa, K., Kuma, K.-i. and Miyata, T. (2002). MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30 3059-3066.
[23] King, V., Zhang, L. and Zhou, Y. (2003). On the complexity of distance-based evolutionary tree reconstruction. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms ( Baltimore , MD , 2003) 444-453. ACM, New York. · Zbl 1094.68614
[24] Lacey, M. R. and Chang, J. T. (2006). A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences. Math. Biosci. 199 188-215. · Zbl 1086.92039 · doi:10.1016/j.mbs.2005.11.003
[25] Liu, K., Raghavan, S., Nelesen, S., Linder, C. R. and Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324 1561-1564.
[26] Löytynoja, A. and Goldman, N. (2008). Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320 1632-1635.
[27] Metzler, D. (2003). Statistical alignment based on fragment insertion and deletion models. Bioinformatics 19 490-499.
[28] Miklos, I., Lunter, G. A. and Holmes, I. (2004). A “Long Indel” model for evolutionary sequence alignment. Mol. Biol. Evol. 21 529-540.
[29] Mossel, E. (2007). Distorted metrics on trees and phylogenetic forests. IEEE/ACM Trans. Comput. Bio. Bioinform. 4 108-116.
[30] Mossel, E. and Roch, S. (2006). Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab. 16 583-614. · Zbl 1137.60034 · doi:10.1214/105051606000000024
[31] Motwani, R. and Raghavan, P. (1995). Randomized Algorithms . Cambridge Univ. Press, Cambridge. · Zbl 0849.68039
[32] Rivas, E. and Eddy, S. R. (2008). Probabilistic phylogenetic inference with insertions and deletions. PLoS Comput. Biol. 4 e1000172, 20. · doi:10.1371/journal.pcbi.1000172
[33] Roch, S. (2008). Sequence-length requirement for distance-based phylogeny reconstruction: Breaking the polynomial barrier. In FOCS 2008 729-738. IEEE Comput. Soc., Los Alamitos, CA.
[34] Semple, C. and Steel, M. (2003). Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications 24 . Oxford Univ. Press, Oxford. · Zbl 1043.92026
[35] Steel, M. A. and Székely, L. A. (1999). Inverting random functions. Ann. Comb. 3 103-113. · Zbl 0966.92018 · doi:10.1007/BF01609880
[36] Steel, M. A. and Székely, L. A. (2002). Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discrete Math. 15 562-575 (electronic). · Zbl 1055.62522 · doi:10.1137/S089548010138790X
[37] Suchard, M. A. and Redelings, B. D. (2006). BAli-Phy: Simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22 2047-2048.
[38] Thatte, B. D. (2006). Invertibility of the TKF model of sequence evolution. Math. Biosci. 200 58-75. · Zbl 1086.92042 · doi:10.1016/j.mbs.2005.12.025
[39] Thorne, J. L., Kishino, H. and Felsenstein, J. (1991). An evolutionary model for maximum likelihood alignment of dna sequences. Journal of Molecular Evolution 33 114-124.
[40] Thorne, J. L., Kishino, H. and Felsenstein, J. (1992). Inching toward reality: An improved likelihood model of sequence evolution. Journal of Molecular Evolution 34 3-16.
[41] Wang, L. and Jiang, T. (1994). On the complexity of multiple sequence alignment. J. Comput. Biol. 1 337-348.
[42] Wong, K. M., Suchard, M. A. and Huelsenbeck, J. P. (2008). Alignment uncertainty and genomic analysis. Science 319 473-476. · Zbl 1226.92028 · doi:10.1126/science.1151532
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.