zbMATH — the first resource for mathematics

Principal components analysis in the space of phylogenetic trees. (English) Zbl 1231.62110
Summary: Phylogenetic analysis of DNA or other data commonly gives rise to a collection or sample of inferred evolutionary trees. Principal Components Analysis (PCA) cannot be applied directly to collections of trees since the space of evolutionary trees on a fixed set of taxa is not a vector space. This paper describes a novel geometrical approach to PCA in tree-space that constructs the first principal path in an analogous way to standard linear Euclidean PCA. Given a data set of phylogenetic trees, a geodesic principal path is sought that maximizes the variance of the data under a form of projection onto the path. Due to the high dimensionality of tree-space and the nonlinear nature of this problem, the computational complexity is potentially very high, so approximate optimization algorithms are used to search for the optimal path. Principal paths identified in this way reveal and quantify the main sources of variation in the original collection of trees in terms of both topology and branch lengths. The approach is illustrated by application to simulated sets of trees and to a set of gene trees from metazoan (animal) species.

62H25 Factor analysis and principal components; correspondence analysis
92D15 Problems related to evolution
65Y20 Complexity and performance of numerical algorithms
62P10 Applications of statistics to biology and medical sciences; meta analysis
65C60 Computational problems in statistics (MSC2010)
Full Text: DOI arXiv
[1] Allen, B. L. and Steel, M. (2001). Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Comb. 5 1-15. · Zbl 0978.05023
[2] Amenta, N., Godwin, M., Postarnakevich, N. and St. John, K. (2007). Approximating geodesic tree distance. Inform. Process. Lett. 103 61-65. · Zbl 1184.68658
[3] Aydin, B., Pataki, G., Wang, H., Bullitt, E. and Marron, J. S. (2009). A principal component analysis for trees. Ann. Appl. Stat. 3 1597-1615. · Zbl 1184.62100
[4] Barthélémy, J.-P. (1986). The median procedure for n -trees. J. Classification 3 329-334. · Zbl 0617.62066
[5] Billera, L. J., Holmes, S. P. and Vogtmann, K. (2001). Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27 733-767. · Zbl 0995.92035
[6] Brinkmann, H., van der Geizen, M., Zhou, Y., Poncelin de Raucourt, G. and Philippe, H. (2005). An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst. Biol. 54 743-757.
[7] Bryant, D. (2003). A classification of consensus methods for phylogenetics. In Bioconsensus ( Piscataway , NJ , 2000/2001). DIMACS Series in Discrete Mathematics and Theoretical Computer Science 61 163-183. Amer. Math. Soc., Providence, RI. · Zbl 1029.05032
[8] Chakerian, J. and Holmes, S. (2010). Computational tools for evaluating phylogenetic and hierachical clustering trees. Available at .
[9] Degnan, J. H. and Salter, L. A. (2005). Gene tree distributions under the coalescent process. Evolution 59 24-37.
[10] Donnelly, P. and Tavaré, S. (1995). Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29 401-421.
[11] Doolittle, W. F. (1999). Lateral genomics. Trends Genet. 15 M5-M8.
[12] Felsenstein, J. (1985). Confidence limits on phylogenies: An approach using the bootstrap. Evolution 39 783-791.
[13] Felsenstein, J. (2004). Inferring Phylogenies . Sinauer, Sunderland, MA.
[14] Fletcher, P. T., Lu, C., Pizer, S. M. and Joshi, S. (2004). Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans. Medical Imaging 23 995-1005.
[15] Goodman, M., Porter, C. A., Czelusniak, J., Page, S. L., Schneider, H., Shoshani, J., Gunnell, G. and Groves, C. P. (1998). Toward a phylogenetic classification of primates based on DNA evidence complemented by fossil evidence. Mol. Phyl. Evol. 9 585-598.
[16] Gromov, M. (1987). Hyperbolic groups. In Essays in Group Theory. Mathematical Sciences Research Institute Publications 8 75-263. Springer, New York. · Zbl 0634.20015
[17] Guindon, S. and Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52 696-704.
[18] Hastie, T. and Stuetzle, W. (1989). Principal curves. J. Amer. Statist. Assoc. 84 502-516. · Zbl 0679.62048
[19] Hillis, D. M., Heath, T. A. and St. John, K. (2005). Analysis and visualization of tree space. Syst. Biol. 54 471-482.
[20] Holmes, S. (2005). Statistical approach to tests involving phylogenies. In Mathematics of Evolution and Phylogeny (O. Gascuel, ed.) 91-120. Oxford Univ. Press, Oxford. · Zbl 1090.62123
[21] Huelsenbeck, J. and Hillis, D. (1993). Success of phylogenetic methods in the four-taxon case. Syst. Biol. 42 247-264.
[22] Kupczok, A., Von Haeseler, A. and Klaere, S. (2008). An exact algorithm for the geodesic distance between phylogenetic trees. J. Comput. Biol. 15 577-591.
[23] Nye, T. M. W. (2008). Trees of trees: An approach to comparing multiple alternative phylogenies. Syst. Biol. 57 785-794.
[24] Nye, T. M. W. (2011). Supplement to “Principal components analysis in the space of phylogenetic trees.” . · Zbl 1231.62110
[25] Owen, M. and Provan, J. S. (2010). A fast algorithm for computing geodesic distances in tree space. IEEE/ACM Trans. Comp. Biol. and Bioinf. 8 2-13.
[26] Rambaut, A. and Grassly, N. C. (1997). Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13 235-238.
[27] Robinson, D. F. and Foulds, L. R. (1981). Comparison of phylogenetic trees. Math. Biosci. 53 131-147. · Zbl 0451.92006
[28] Stockham, C., Wang, L.-S. and Warnow, T. (2002). Statistically based postprocessing of phylogenetic analysis by clustering. Bioinformatics 18 S285-S293.
[29] Wang, H. and Marron, J. S. (2007). Object oriented data analysis: Sets of trees. Ann. Statist. 35 1849-1873. · Zbl 1126.62002
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.