Purdom, Elizabeth Analysis of a data matrix and a graph: metagenomic data and the phylogenetic tree. (English) Zbl 1234.62148 Ann. Appl. Stat. 5, No. 4, 2326-2358 (2011). Summary: In biological experiments researchers often have information in the form of a graph that supplements observed numerical data. Incorporating the knowledge contained in these graphs into an analysis of the numerical data is an important and nontrivial task. We look at the example of metagenomic data-data from a genomic survey of the abundance of different species of bacteria in a sample. Here, the graph of interest is a phylogenetic tree depicting the interspecies relationships among the bacteria species. We illustrate that analysis of the data in a nonstandard inner-product space effectively uses this additional graphical information and produces more meaningful results. Cited in 7 Documents MSC: 62P10 Applications of statistics to biology and medical sciences; meta analysis 62H25 Factor analysis and principal components; correspondence analysis 92D15 Problems related to evolution 05C99 Graph theory 65C60 Computational problems in statistics (MSC2010) Keywords:multivariate analysis; principal components Software:ade4; sedaR; R × Cite Format Result Cite Review PDF Full Text: DOI arXiv References: [1] Aluja-Ganet, T. and Nonell-Torrent, R. (1991). Local principal components analysis. Questiio 15 267-278. · Zbl 1167.62306 [2] Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. J. Mach. Learn. Res. 3 1-48. · Zbl 1088.68689 · doi:10.1162/153244303768966085 [3] Bapat, R., Kirkland, S. J. and Neumann, M. (2005). On distance matrices and Laplacians. Linear Algebra Appl. 401 193-209. · Zbl 1064.05097 · doi:10.1016/j.laa.2004.05.011 [4] Biyikoğlu, T., Leydold, J. and Stadler, P. F. (2007). Laplacian Eigenvectors of Graphs. Lecture Notes in Mathematics 1915 . Springer, Berlin. · Zbl 1129.05001 · doi:10.1007/978-3-540-73510-6 [5] Cavalli-Sforza, L. L. and Piazza, A. (1975). Analysis of evolution: Evolutionary rates, independence and treeness. Theoretical Population Biology 8 127-165. · Zbl 0327.92009 · doi:10.1016/0040-5809(75)90029-5 [6] Chessel, D., Dufour, A.-B., Dray, S., with contributions from Jean R. Lobry, Ollier, S., Pavoine, S. and Thioulouse., J. (2005). ade4: Analysis of environmental data: Exploratory and Euclidean methods in environmental sciences. R package Version 1.4-1. [7] D’Ambra, L. and Lauro, N. C. (1992). Non-symmetrical exploratory data analysis. Statist. Appl. 4 511-529. [8] di Bella, G. and Jona-Lasinio, G. (1996). Including spatial contiguity information in the analysis of multispecific patterns. Environmental and Ecological Statistics 3 260-280. [9] Diestel, R. (2005). Graph Theory , 3rd ed. Graduate Texts in Mathematics 173 . Springer, New York. · Zbl 1074.05001 [10] Dray, S. and Dufour, A.-B. (2007). The ade4 package: Implementing the duality diagram for ecologists. J. Statist. Softw. 22 . [11] Dray, S., Saïd, S. and Debias, F. (2008). Spatial ordination of vegetation data using a generalization of Wartenberg’s multivariate spatial correlation. Journal of Vegetation Science 19 45-56. [12] Eckburg, P. B., Bik, E. M., Bernstein, C. N., Purdom, E., Dethlefsen, L., Sargent, M., Gill, S. R., Nelson, K. E. and Relman, D. A. (2005). Diversity of the human intestinal microbial flora. Science 308 1635-1638. [13] Escoufier, Y. (1987). The duality diagram: A means for better practical applications. In Developments in Numerical Ecology (P. Legendre and L. Legendre, eds.). NATO ASI Series G14 139-156. Springer, Berlin. · doi:10.1007/978-3-642-70880-0_3 [14] Excoffier, L., Smouse, P. and Quattro, J. (1992). Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics 131 479-491. [15] Felsenstein, J. (1981). Evolutionary trees from gene frequencies and quantitative characters: Finding maximum likelihood estimates. Evolution 35 1229-1242. [16] Gimaret-Carpentier, C., Chessel, D. and Pascal, J. P. (1998). Non-symmetric correspondence analysis: An alternative for community analysis with species occurrences data. Plant Ecology 138 97-112. [17] Golub, G. H. and van Loan, C. F. (1996). Matrix Computations , 3rd ed. Johns Hopkins Univ. Press, Baltimore. · Zbl 0865.65009 [18] Greenacre, M. J. (1984). Theory and Applications of Correspondence Analysis . Academic Press, London. · Zbl 0555.62005 [19] Hansen, T. F. and Martins, E. P. (1996). Translating between microevolutionary process and macroevolutionary patterns: The correlation structure of interspecific data. Evolution 50 1404-1417. [20] Holmes, S. (2008). Multivariate analysis: The French way. In Probability and Statistics: Essays in Honor of David A. Freedman (D. Nolan and T. Speed, eds.). IMS Lecture Notes 2 219-233. IMS, Beachwood, OH. · Zbl 1166.62310 · doi:10.1214/193940307000000455 [21] Jolliffe, I. T. (2002). Principal Components Analysis , 2nd ed. Springer, New York. · Zbl 1011.62064 [22] Kondor, R. I. and Lafferty, J. (2002). Diffusion kernels on graphs and other discrete input spaces. In Proceedings of ICML 315-322. [23] Legendre, P. and Legendre, L. (1998). Numerical Ecology , 2nd English ed. Developments in Environmental Modeling 20 . Elsevier, New York. · Zbl 1033.92036 [24] Maesschalck, R. D., Jouan-Rimbaud, D. and Massart, D. (2000). The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems 50 1-18. [25] Martin, A. (2002). Phylogenetic approaches for describing and comparing the diversity of microbial communities. Applied and Environmental Microbiology 68 3673-3682. [26] Martins, E. P. and Housworth, E. A. (2002). Phylogeny shape and the phylogenetic comparative method. Syst. Biol. 51 873-880. [27] Pavoine, S., Dufour, A.-B. and Chessel, D. (2004). From dissimilarities among species to dissimilarities among sites: A double principal coordinate analysis. J. Theoret. Biol. 228 523-537. · doi:10.1016/j.jtbi.2004.02.014 [28] Pavoine, S., Ollier, S., Pontier, D. and Chessel, D. (2008). Testing for phylogenetic signal in phenotypic traits: New matrices of phylogenetic proximities. Theoretical Population Biology 73 79-91. · Zbl 1202.92065 · doi:10.1016/j.tpb.2007.10.001 [29] Pélissier, R., Couteron, P., Dray, S. and Sabatier, D. (2003). Consistency between ordination techniques and diversity measurements: Two strategies for species occurrence data. Ecology 84 242-251. [30] Purdom, E. (2006). Multivariate kernel methods in the analysis of graphical structures. Ph.D. thesis, Stanford Univ. [31] R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. [32] Rao, C. R. (1982). Diversity and dissimilarity coefficients: A unified approach. Theoretical Population Biology 21 24-43. · Zbl 0516.92021 · doi:10.1016/0040-5809(82)90004-1 [33] Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E. and Vert, J.-P. (2007). Classification of microarray data using gene networks. BMC Bioinformatics 8 . [34] Rohlf, F. J. (2001). Comparative methods for the analysis of continuous variables: Geometric interpretations. Evolution 55 2143-2160. · Zbl 1095.92036 [35] Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond . MIT Press, Cambridge, MA. [36] Thioulouse, J., Chessel, D. and Champely, S. (1995). Multivariate analysis of spatial patterns: A unified approach to local and global structures. Environmental and Ecological Statistics 2 1-14. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.