×

Spectral neighbor joining for reconstruction of latent tree models. (English) Zbl 1468.62301

Summary: A common assumption in multiple scientific applications is that the distribution of observed data can be modeled by a latent tree graphical model. An important example is phylogenetics, where the tree models the evolutionary lineages of a set of observed organisms. Given a set of independent realizations of the random variables at the leaves of the tree, a key challenge is to infer the underlying tree topology. In this work we develop spectral neighbor joining (SNJ), a novel method to recover the structure of latent tree graphical models. Given a matrix that contains a measure of similarity between all pairs of observed variables, SNJ computes a spectral measure of cohesion between groups of observed variables. We prove that SNJ is consistent and derive a sufficient condition for correct tree recovery from an estimated similarity matrix. Combining this condition with a concentration of measure result on the similarity matrix, we bound the number of samples required to recover the tree with high probability. We illustrate via extensive simulations that in comparison to several other reconstruction methods, SNJ requires fewer samples to accurately recover trees with a large number of leaves or long edges.

MSC:

62H22 Probabilistic graphical models
62M05 Markov processes: estimation; hidden Markov models
62M15 Inference from stochastic processes and spectral analysis
15A18 Eigenvalues, singular values, and eigenvectors

Software:

DendroPy; RAxML
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] E. S. Allman, L. S. Kubatko, and J. A. Rhodes, Split scores: A tool to quantify phylogenetic signal in genome-scale data, Systematic Biol., 66 (2017), pp. 620-636.
[2] E. S. Allman and J. A. Rhodes, Molecular phylogenetics from an algebraic viewpoint, Statist. Sinica, (2007), pp. 1299-1316. · Zbl 1132.92019
[3] A. Anandkumar, K. Chaudhuri, D. J. Hsu, S. M. Kakade, L. Song, and T. Zhang, Spectral methods for learning multivariate latent tree structure, in Advances in Neural Information Processing Systems, 2011, pp. 2025-2033.
[4] S. Aris-Brosou and L. Excoffier, The impact of population expansion and mutation rate heterogeneity on DNA sequence polymorphism, Molecular Biol. Evol., 13 (1996), pp. 494-504.
[5] K. Atteson, The performance of neighbor-joining methods of phylogenetic reconstruction, Algorithmica, 25 (1999), pp. 251-278. · Zbl 0938.68747
[6] D. Bryant, On the uniqueness of the selection criterion in neighbor-joining, J. Classification, 22 (2005), pp. 3-15. · Zbl 1083.62108
[7] J. H. Camin and R. R. Sokal, A method for deducing branching sequences in phylogeny, Evolution, 19 (1965), pp. 311-326.
[8] J. A. Cavender and J. Felsenstein, Invariants of phylogenies in a simple case with discrete states, J. Classification, 4 (1987), pp. 57-71. · Zbl 0612.62142
[9] J. T. Chang, Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency, Math. Biosci., 137 (1996), pp. 51-73. · Zbl 1059.92504
[10] J. T. Chang and J. A. Hartigan, Reconstruction of evolutionary trees from pairwise distributions on current species, in Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, Interface Foundation, Fairfax Station, VA, 1991, pp. 254-257.
[11] M. J. Choi, V. Y. Tan, A. Anandkumar, and A. S. Willsky, Learning latent tree graphical models, J. Mach. Learn. Res., 12 (2011), pp. 1771-1812. · Zbl 1280.68160
[12] W. H. Day and D. Sankoff, Computational complexity of inferring phylogenies by compatibility, Systematic Biol., 35 (1986), pp. 224-229.
[13] F. Delsuc, H. Brinkmann, and H. Philippe, Phylogenomics and the reconstruction of the tree of life, Nature Rev. Genetics, 6 (2005), pp. 361-375.
[14] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1998. · Zbl 0929.92010
[15] P. L. Erdös, M. A. Steel, L. A. Székely, and T. J. Warnow, A few logs suffice to build (almost) all trees (I), Random Structures Algorithms, 14 (1999), pp. 153-184. · Zbl 0945.60004
[16] N. Eriksson, Tree construction using singular value decomposition, in Algebraic Statistics for Computational Biology, L. Pachter and B. Sturmfels, eds., Cambridge University Press, 2005, pp. 347-358. · Zbl 1374.60140
[17] G. F. Estabrook, F. McMorris, and C. A. Meacham, Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units, Systematic Zoology, 34 (1985), pp. 193-200.
[18] J. Felsenstein, Evolutionary trees from DNA sequences: a maximum likelihood approach, J. Molecular Evol., 17 (1981), pp. 368-376.
[19] J. Felsenstein, Inferring Phylogenies, 2nd ed., Sinauer Associates, Sunderland, MA, 2004.
[20] J. Fernández-Sánchez and M. Casanellas, Invariant versus classical quartet inference when evolution is heterogeneous across sites and lineages, Systematic Biol., 65 (2016), pp. 280-291.
[21] W. M. Fitch, Toward defining the course of evolution: Minimum change for a specific tree topology, Systematic Biol., 20 (1971), pp. 406-416.
[22] O. Gascuel and M. Steel, Neighbor-joining revealed, Molecular Biol. Evol., 23 (2006), pp. 1997-2000.
[23] O. Gascuel and M. Steel, A “stochastic safety radius” for distance-based tree reconstruction, Algorithmica, 74 (2016), pp. 1386-1403. · Zbl 1350.92035
[24] X. Gu, Y. X. Fu, and W. H. Li, Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites, Molecular Biol. Evol., 12 (1995), pp. 546-557.
[25] S. Guindon and O. Gascuel, A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood, Systematic Biol., 52 (2003), pp. 696-704.
[26] M. Hajdinjak, Q. Fu, A. Hübner, M. Petr, F. Mafessoni, S. Grote, P. Skoglund, V. Narasimham, H. Rougier, I. Crevecoeur et al., Reconstructing the genetic history of late Neanderthals, Nature, 555 (2018), pp. 652-656.
[27] S. Harmeling and C. K. Williams, Greedy learning of binary latent trees, IEEE Trans. Pattern Anal. Mach. Intell., 33 (2010), pp. 1087-1097.
[28] F. Huang, N. Un, J. Perros, R. Chen, J. Sun, and A. Anandkumar, Scalable latent tree model and its application to health analytics, in Machine Learning in Healthcare, NIPS Workshop, 2015.
[29] A. Jaffe, E. Fetaya, B. Nadler, T. Jiang, and Y. Kluger, Unsupervised ensemble learning with dependent classifiers, Proc. Mach. Learn. Res., 51 (2016), pp. 351-360.
[30] A. Jaffe, B. Nadler, and Y. Kluger, Estimating the accuracies of multiple classifiers without labeled data, Proc. Mach. Learn. Res., 38 (2015), pp. 407-415.
[31] A. Jaffe, R. Weiss, S. Carmi, Y. Kluger, and B. Nadler, Learning binary latent variable models: A tensor eigenpair approach, in Proceedings of the 35th International Conference on Machine Learning, Curran Associates, 2018, pp. 2196-2205.
[32] F. Jia, N. Lo, and S. Y. W. Ho, The impact of modelling rate heterogeneity among sites on phylogenetic estimates of intraspecific evolutionary rates and timescales, PLoS One, 9 (2014), pp. 1-8.
[33] T. Jiang, P. Kearney, and M. Li, A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application, SIAM J. Comput., 30 (2001), pp. 1942-1961, https://doi.org/10.1137/S0097539799361683. · Zbl 0980.68055
[34] K. S. John, T. Warnow, B. M. Moret, and L. Vawter, Performance study of phylogenetic methods: (Unweighted) quartet methods and neighbor-joining, J. Algorithms, 48 (2003), pp. 173-193. · Zbl 1079.68646
[35] T. H. Jukes and C. R. Cantor, Evolution of protein molecules, Mammalian Protein Metabolism, 3 (1969), pp. 21-132.
[36] M. R. Lacey and J. T. Chang, A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences, Math. Biosci., 199 (2006), pp. 188-215. · Zbl 1086.92039
[37] J. A. Lake, Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances, Proc. Natl. Acad. Sci. USA, 91 (1994), pp. 1455-1459.
[38] R. S. Lanciotti, A. J. Lambert, M. Holodniy, S. Saavedra, and L. del Carmen Castillo Signor, Phylogeny of Zika virus in Western Hemisphere, 2015, Emerging Infectious Diseases, 22 (2016), pp. 933-935.
[39] R. Mihaescu, D. Levy, and L. Pachter, Why neighbor-joining works, Algorithmica, 54 (2009), pp. 1-24. · Zbl 1187.68683
[40] E. Mossel and S. Roch, Learning nonsingular phylogenies and hidden Markov models, in Proceedings of the 37th Annual ACM Symposium on Theory of Computing, 2005, pp. 366-375. · Zbl 1192.68394
[41] R. Mourad, C. Sinoquet, N. L. Zhang, T. Liu, and P. Leray, A survey on latent tree models and applications, J. Artificial Intelligence Res., 47 (2013), pp. 157-203. · Zbl 1270.68097
[42] M. Nei and S. Kumar, Molecular Evolution and Phylogenetics, Oxford University Press, 2000.
[43] F. Parisi, F. Strino, B. Nadler, and Y. Kluger, Ranking and combining multiple predictors without labeled data, Proc. Natl. Acad. Sci. USA, 111 (2014), pp. 1253-1258. · Zbl 1359.62259
[44] Y. Pauplin, Direct calculation of a tree length using a distance matrix, J. Molecular Evol., 51 (2000), pp. 41-47.
[45] J. Pearl and M. Tarsi, Structuring causal trees, J. Complexity, 2 (1986), pp. 60-77. · Zbl 0589.68060
[46] B. Rannala and Z. Yang, Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference, J. Molecular Evol., 43 (1996), pp. 304-311.
[47] V. Ranwez and O. Gascuel, Quartet-based phylogenetic inference: Improvements and limits, Molecular Biol. Evol., 18 (2001), pp. 1103-1116.
[48] J. A. Rhodes, Topological metrizations of trees, and new quartet methods of tree inference, IEEE/ACM Trans. Comput. Biol. Bioinform., 17 (2020), pp. 2107-2118.
[49] S. Roch, A short proof that phylogenetic tree reconstruction by maximum likelihood is hard, IEEE/ACM Trans. Comput. Biol. Bioinform., 3 (2006), pp. 92-94.
[50] J. P. Rusinko and B. Hipp, Invariant based quartet puzzling, Algorithms Molecular Biol., 7 (2012), 35.
[51] N. Saitou and M. Nei, The neighbor-joining method: A new method for reconstructing phylogenetic trees, Molecular Biol. Evol., 4 (1987), pp. 406-425.
[52] C. Semple and M. Steel, Phylogenetics, Oxford Lecture Ser. Math. Appl. 24, Oxford University Press, 2003. · Zbl 1043.92026
[53] A. B. Smith, Rooting molecular trees: Problems and strategies, Biol. J. Linnean Soc., 51 (1994), pp. 279-292.
[54] S. Snir, T. Warnow, and S. Rao, Short quartet puzzling: A new quartet-based phylogeny reconstruction algorithm, J. Comput. Biol., 15 (2008), pp. 91-103.
[55] R. R. Sokal, A statistical method for evaluating systematic relationships, Univ. Kansas Sci. Bull., 38 (1958), pp. 1409-1438.
[56] A. Stamatakis, RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models, Bioinformatics, 22 (2006), pp. 2688-2690.
[57] M. Steel, Phylogeny: Discrete and Random Processes in Evolution, SIAM, 2016, https://doi.org/10.1137/1.9781611974485. · Zbl 1361.92001
[58] K. Strimmer and A. von Haeseler, Accuracy of neighbor joining for n-taxon trees, Systematic Biol., 45 (1996), pp. 516-523.
[59] K. Strimmer and A. Von Haeseler, Quartet puzzling: A quartet maximum-likelihood method for reconstructing tree topologies, Molecular Biol. Evol., 13 (1996), pp. 964-969.
[60] J. Sukumaran and M. T. Holder, Dendropy: A Python library for phylogenetic computing, Bioinformatics, 26 (2010), pp. 1569-1571.
[61] E. Susko, Y. Inagaki, and A. J. Roger, On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled, Molecular Biol. Evol., 21 (2004), pp. 1629-1642.
[62] K. Tamura, M. Nei, and S. Kumar, Prospects for inferring very large phylogenies by using the neighbor-joining method, Proc. Natl. Acad. Sci. USA, 101 (2004), pp. 11030-11035.
[63] P. J. Waddell and M. Steel, General time-reversible distances with unequal rates across sites: Mixing \(\gamma\) and inverse Gaussian distributions with invariant sites, Molecular Phylogenetics Evol., 8 (1997), pp. 398-414.
[64] J. Wakeley, Coalescent theory: An introduction, W. H. Freeman, 2009. · Zbl 1366.92001
[65] M. Wilkinson, J. O. McInerney, R. P. Hirt, P. G. Foster, and T. M. Embley, Of clades and clans: Terms for phylogenetic relationships in unrooted trees, Trends Ecol. Evol., 22 (2007), pp. 114-115.
[66] Z. Yang and B. Rannala, Molecular phylogenetics: Principles and practice, Nature Rev. Genetics, 13 (2012), pp. 303-314.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.