×

Tropical sufficient statistics for persistent homology. (English) Zbl 1443.62018

Summary: We show that an embedding in Euclidean space based on tropical geometry generates stable sufficient statistics for barcodes. In topological data analysis, barcodes are multiscale summaries of algebraic topological characteristics that capture the “shape” of data; however, in practice, they have complex structures that make them difficult to use in statistical settings. The sufficiency result presented in this work allows for classical probability distributions to be assumed on the tropical geometric representation of barcodes. This makes a variety of parametric statistical inference methods accessible to barcodes, all while maintaining their initial interpretations. More specifically, we show that exponential family distributions may be assumed and that likelihood functions for persistent homology may be constructed. We conceptually demonstrate sufficiency and illustrate its utility in persistent homology dimensions 0 and 1 with concrete parametric applications to human immunodeficiency virus and avian influenza data.

MSC:

62B05 Sufficient statistics and fields
62R40 Topological data analysis
55N31 Persistent homology and applications, topological data analysis
14T90 Applications of tropical geometry
62P10 Applications of statistics to biology and medical sciences; meta analysis
92D30 Epidemiology
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] I. Abraham, Y. Bartal, and O. Neiman, Embedding metrics into ultrametrics and graphs into spanning trees with constant average distortion, in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, Philadelphia, pp. 502-511. · Zbl 1302.68271
[2] H. Adams and G. Carlsson, Evasion paths in mobile sensor networks, Int. J. Robotics Res., 34 (2015), pp. 90-104.
[3] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier, Persistence images: A stable vector representation of persistent homology, J. Mach. Learn. Res., 18 (2017), pp. 218-252, http://dl.acm.org/citation.cfm?id=3122009.3122017. · Zbl 1431.68105
[4] A. Adcock, E. Carlsson, and G. Carlsson, The ring of algebraic functions on persistence bar codes, Homology Homotopy Appl., 18 (2016), pp. 381-402. · Zbl 1420.55017
[5] A. Adcock, D. Rubin, and G. Carlsson, Classification of hepatic lesions using the matching metric, Comput. Vision Image Understanding, 121 (2014), pp. 36-42, http://www.sciencedirect.com/science/article/pii/S1077314213002221.
[6] R. Adler and J. Taylor, Topological Complexity of Smooth Random Functions: École d’Été de Probabilités de Saint-Flour XXXIX-\(2009\), Lecture Notes in Mathematics, Springer, Berlin, 2011. · Zbl 1230.60001
[7] R. J. Adler, S. Agami, and P. Pranav, Modeling and replicating statistical topology and evidence for CMB nonhomogeneity, Proc. Natl. Acad. Sci., (2017), http://www.pnas.org/content/early/2017/10/23/1706885114. · Zbl 1407.55004
[8] S. Anders and W. Huber, Differential expression analysis for sequence count data, Genome Biol., 11 (2010), R106, https://doi.org/10.1186/gb-2010-11-10-r106.
[9] M. S. Bartlett, Properties of sufficiency and statistical tests, Proc. Roy. Soc. London Ser. A Math. Phys. Sci., 160 (1937), pp. 268-282. · Zbl 0016.41201
[10] U. Bauer, Ripser, 2015, https://github.com/Ripser/ripser.
[11] P. Bendich, S. P. Chin, J. Clark, J. Desena, J. Harer, E. Munch, A. Newman, D. Porter, D. Rouse, N. Strawn, and A. Watkins, Topological and statistical behavior classifiers for tracking applications, IEEE Trans. Aerospace Electronic Syst., 52 (2016), pp. 2644-2661, https://doi.org/10.1109/TAES.2016.160405.
[12] L. J. Billera, S. P. Holmes, and K. Vogtmann, Geometry of the Space of Phylogenetic Trees, Adv. Appl. Math., 27 (2001), pp. 733-767, http://www.sciencedirect.com/science/article/pii/S0196885801907596. · Zbl 0995.92035
[13] P. Billingsley, Probability and Measure, John Wiley & Sons, New York, 1979. · Zbl 0411.60001
[14] A. J. Blumberg, I. Gal, M. A. Mandell, and M. Pancia, Robust statistics, hypothesis testing, and confidence intervals for persistent homology on metric measure spaces, Found. Comput. Math., 14 (2014), pp. 745-789. · Zbl 1364.55016
[15] P. Bubenik, Statistical topological data analysis using persistence landscapes, J. Mach. Learn. Res., 16 (2015), pp. 77-102. · Zbl 1337.68221
[16] S. Capella-Gutiérrez, J. M. Silla-Martínez, and T. Gabaldón, trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses, Bioinformatics, 25 (2009), pp. 1972-1973, http://dx.doi.org/10.1093/bioinformatics/btp348.
[17] G. Carlsson, Topological pattern recognition for point cloud data, Acta Numerica, 23 (2014), pp. 289-368. · Zbl 1398.68615
[18] G. Carlsson and S. Kališnik Verovšek, Symmetric and \(r\)-symmetric tropical polynomials and rational functions, J. Pure Appl. Algebra, 220 (2016), pp. 3610-3627, http://www.sciencedirect.com/science/article/pii/S0022404916300251. · Zbl 1375.14211
[19] M. Carrière, S. Y. Oudot, and M. Ovsjanikov, Stable Topological Signatures for Points on 3D Shapes, in Proceedings of the Eurographics Symposium on Geometry Processing 2015, 34 (2015), pp. 77-102.
[20] J. M. Chan, G. Carlsson, and R. Rabadán, Topology of viral evolution, in Proceedings of the National Academy of Sciences, 110 (2013), pp. 18566-18571, http://www.pnas.org/content/110/46/18566.abstract. · Zbl 1292.92014
[21] F. Chazal, B. T. Fasy, F. Lecci, A. Rinaldo, and L. Wasserman, Stochastic convergence of persistence landscapes and silhouettes, in Proceedings of the Thirtieth Annual Symposium on Computational Geometry, SOCG’14, New York, 2014, ACM, pp. 474:474-474:483, http://doi.acm.org/10.1145/2582112.2582128. · Zbl 1395.62187
[22] Y.-C. Chen, D. Wang, A. Rinaldo, and L. Wasserman, Statistical analysis of persistence intensity functions, preprint, arXiv, (2015), .
[23] M. K. Chung, P. Bubenik, and P. T. Kim, Persistence diagrams of cortical surface data, in Information Processing in Medical Imaging, J. L. Prince, D. L. Pham, and K. J. Myers, eds., Springer, Berlin, 2009, pp. 386-397.
[24] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, Stability of persistence diagrams, Discrete Comput. Geometry, 37 (2007), pp. 103-120, https://doi.org/10.1007/s00454-006-1276-5. · Zbl 1117.54027
[25] L. Crawford, A. Monod, A. X. Chen, S. Mukherjee, and R. Rabadán, Functional Data Analysis using a Topological Summary Statistic: the Smooth Euler Characteristic Transform, preprint, arXiv, (2016), .
[26] L. Crawford, K. C. Wood, X. Zhou, and S. Mukherjee, Bayesian approximate kernel regression with variable selection, J. Amer. Stat. Assoc., (2018), pp. 1-12, https://doi.org/10.1080/01621459.2017.1361830.
[27] C. Curto, V. Itskov, A. Veliz-Cuba, and N. Youngs, The neural ring: An algebraic tool for analyzing the intrinsic structure of neural codes, Bull. Math. Biol., 75 (2013), pp. 1571-1611. · Zbl 1311.92043
[28] G. Darmois, Sur les lois de probabilités à estimation exhaustive, C.R. Acad. Sci. Paris (in French), 200 (1935), pp. 1265-1266. · JFM 61.1308.05
[29] P. Diaconis, Sufficiency as statistical symmetry, in Proceedings of the AMS Centennial Symposium, 1988, pp. 15-26. · Zbl 0928.62006
[30] P. Donnelly and S. Tavaré, Coalescents and genealogical structure under neutrality, Ann. Rev. Genetics, 29 (1995), pp. 401-421, https://doi.org/10.1146/annurev.ge.29.120195.002153. PMID: 8825481.
[31] J. Duchi, Derivations for Linear Algebra and Optimization, Technical report, Berkeley, California, 2007, https://web.stanford.edu/ jduchi/projects/general_notes.pdf.
[32] H. Edelsbrunner, D. Letscher, and A. J. Zomorodian, Topological persistence and simplification, Discrete Comput. Geometry, 28 (2002), pp. 511-533. · Zbl 1011.68152
[33] K. Emmett, D. Rosenbloom, P. Cámara, and R. Rabadán, Parametric inference using persistence diagrams: a case study in population genetics, preprint, arXiv, (2014), .
[34] B. T. Fasy, F. Lecci, A. Rinaldo, L. Wasserman, S. Balakrishnan, and A. Singh, Confidence sets for persistence diagrams, Ann. Stat., 42 (2014), pp. 2301-2339, https://doi.org/10.1214/14-AOS1252. · Zbl 1310.62059
[35] M. Ferri and C. Landi, Representing size functions by complex polynomials, Proc. Math. Met. Pattern Recognition, 9 (1999), pp. 16-19.
[36] M. Ferri and I. Stanganelli, Size Functions for the Morphological Analysis of Melanocytic Lesions, J. Biomedical Imaging, 2010 (2010), pp. 5:1-5:5, https://doi.org/10.1155/2010/621357.
[37] R. A. Fisher, On the mathematical foundations of theoretical statistics, Philosoph. Trans. Roy. Soc. London A Math. Phys. Engrg. Sci., 222 (1922), pp. 309-368. · JFM 48.1280.02
[38] W. M. Fitch and E. Margoliash, Construction of Phylogenetic Trees, Science, New York, 155 (1967), pp. 279-284, http://www.jstor.org/stable/1720651.
[39] P. Frosini and C. Landi, Size theory as a topological tool for computer vision, Pattern Recognition Image Anal., 9 (1999), pp. 596-603.
[40] R. Ghrist, Barcodes: The persistent topology of data, Bull. Amer. Math. Soc. (N.S.), 45 (2008), pp. 61-75, https://doi.org/10.1090/S0273-0979-07-01191-3. · Zbl 1391.55005
[41] R. Ghrist and V. de Silva, Coordinate-free coverage in sensor networks with controlled boundaries via homology, Int. J. Robotics Res., 25 (2006), pp. 1205-1222. · Zbl 1202.94174
[42] C. Giusti, E. Pastalkova, C. Curto, and V. Itskov, Clique topology reveals intrinsic geometric structure in neural correlations, Proc. Natl. Acad. Sci., 112 (2015), pp. 13455-13460, http://www.pnas.org/content/112/44/13455. · Zbl 1355.92015
[43] S. Guindon, J.-F. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, and O. Gascuel, New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of phyml 3.0, Syst. Biol., 59 (2010), pp. 307-321, https://doi.org/10.1093/sysbio/syq010.
[44] P. R. Halmos and L. J. Savage, Application of the Radon-Nikodym Theorem to the Theory of Sufficient Statistics, Ann. Math. Stat., 20 (1949), pp. 225-241, https://doi.org/10.1214/aoms/1177730032. · Zbl 0034.07502
[45] A. S. Hassan, O. G. Pybus, E. J. Sanders, J. Albert, and J. Esbjornsson, Defining HIV-1 transmission clusters based on sequence data, AIDS, 31 (2017), pp. 1211-1222.
[46] A. Hatcher, Algebraic Topology, Algebraic Topology, Cambridge University Press, Cambridge, 2002. · Zbl 1044.55001
[47] E. Hellinger, Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen., J. Reine Angew. Math., 136 (1909), pp. 210-271. · JFM 40.0393.01
[48] C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl, Deep learning with topological signatures, in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., Curran Associates, 2017, pp. 1634-1644, http://papers.nips.cc/paper/6761-deep-learning-with-topological-signatures.pdf.
[49] Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li, CD-HIT Suite: A web server for clustering and comparing biological sequences, Bioinformatics, 26 (2010), pp. 680-682.
[50] S. Hué, D. Pillay, J. P. Clewley, and O. G. Pybus, Genetic analysis reveals the complex structure of HIV-1 transmission within defined risk groups, Proc. Natl. Acad. Sci., 102 (2005), pp. 4425-4429, http://www.pnas.org/content/102/12/4425.
[51] S. Kališnik, Tropical coordinates on the space of persistence barcodes, Found. Comput. Math., (2018), pp. 1-29.
[52] K. Katoh and D. M. Standley, MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability, Molecular Biol. Evolution, 30 (2013), pp. 772-780, https://doi.org/10.1093/molbev/mst010.
[53] M. Kerber, D. Morozov, and A. Nigmetov, Geometry Helps to Compare Persistence Diagrams, J. Experimental Algorithmics, 22 (2017), pp. 1-20, https://doi.org/10.1145/3064175. · Zbl 1414.68129
[54] A. M. Kilpatrick, A. A. Chmura, D. W. Gibbons, R. C. Fleischer, P. P. Marra, and P. Daszak, Predicting the global spread of \(H5 N1\) avian influenza, Proc. Natl. Acad. Sci., 103 (2006), pp. 19368-19373, https://doi.org/10.1073/pnas.0609227103.
[55] B. O. Koopman, On distributions admitting a sufficient statistic, Trans. Amer. Math. Soc., 39 (1936), pp. 399-409. · JFM 62.0611.03
[56] S. Kullback and R. A. Leibler, On information and sufficiency, Ann. Math. Stat., 22 (1951), pp. 79-86, https://projecteuclid.org:443/euclid.aoms/1177729694. · Zbl 0042.38403
[57] R. Kwitt, S. Huber, M. Niethammer, W. Lin, and U. Bauer, Statistical Topological Data Analysis – A Kernel Perspective, in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds., Curran Associates, 2015, pp. 3070-3078.
[58] L. Le Cam, Sufficiency and approximate sufficiency, Ann. Math. Stat., 35 (1964), pp. 1419-1455, http://www.jstor.org/stable/2238284. · Zbl 0129.11202
[59] B. Lin, A. Monod, and R. Yoshida, Tropical Foundations for Probability & Statistics on Phylogenetic Tree Space, preprint, arXiv, (2018), .
[60] M. I. Love, J. B. Hogenesch, and R. A. Irizarry, Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation, Nature Biotechnology, 34 (2016), pp. 1287-1291, http://dx.doi.org/10.1038/nbt.3682.
[61] I. Maljkovic Berry, M. C. Melendrez, T. Li, A. W. Hawksworth, G. T. Brice, P. J. Blair, E. S. Halsey, M. Williams, S. Fernandez, I.-K. Yoon, L. D. Edwards, R. Kuschner, X. Lin, S. J. Thomas, and R. G. Jarman, Frequency of influenza H3N2 intra-subtype reassortment: Attributes and implications of reassortant spread, BMC Biology, 14 (2016), p. 117, https://doi.org/10.1186/s12915-016-0337-3.
[62] N. Marshall, L. Priyamvada, Z. Ende, J. Steel, and A. C. Lowen, Influenza virus reassortment occurs with high frequency in the absence of segment mismatch, PLoS Pathog., 9 (2013), e1003421.
[63] Y. Mileyko, S. Mukherjee, and J. Harer, Probability measures on the space of persistence diagrams, Inverse Problems, 27 (2011), p. 124007. · Zbl 1247.68310
[64] J. Neyman, Su un teorema concernente le cosiddette statistiche sufficienti, Giornale Dell’Istituto Italiano degli Attuari, 6 (1935), pp. 320-334. · JFM 61.1310.01
[65] T. H. Nguyen, V. T. Than, H. D. Thanh, V.-K. Hung, D. T. Nguyen, and W. Kim, Intersubtype Reassortments of H5N1 Highly Pathogenic Avian Influenza Viruses Isolated from Quail, PLoS ONE, 11 (2016), pp. 1-15, https://doi.org/10.1371/journal.pone.0149608.
[66] N. Otter, M. A. Porter, U. Tillmann, P. Grindrod, and H. A. Harrington, A roadmap for the computation of persistent homology, EPJ Data Science, 6 (2017), p. 17, https://doi.org/10.1140/epjds/s13688-017-0109-5.
[67] L. Pachter and B. Sturmfels, Algebraic Statistics for Computational Biology, Vol. 13, Cambridge University Press, Cambridge, 2005. · Zbl 1108.62118
[68] E. Paradis, J. Claude, and K. Strimmer, APE: Analyses of Phylogenetics and Evolution in R language, Bioinformatics, 20 (2004), pp. 289-290, https://doi.org/10.1093/bioinformatics/btg412.
[69] K. Parthasarathy, Probability and mathematical statistics: A series of monographs and textbooks, in Probability Measures on Metric Spaces, Probability and Mathematical Statistics: A Series of Monographs and Textbooks, Academic Press, 1967, ii p, https://doi.org/10.1016/B978-1-4832-0022-4.50001-6. · Zbl 0153.19101
[70] J. Á. Patin͂o-Galindo, M. Torres-Puente, M. A. Bracho, I. Alastrué, A. Juan, D. Navarro, M. J. Galindo, C. Gimeno, E. Ortega, and F. González-Candelas, Identification of a large, fast-expanding HIV-1 subtype B transmission cluster among MSM in Valencia, Spain, PLoS ONE, 12 (2017), e0171062.
[71] J. Á. Patin͂o-Galindo, M. Torres-Puente, M. A. Bracho, I. Alastrué, A. Juan, D. Navarro, M. J. Galindo, D. Ocete, E. Ortega, C. Gimeno, J. Belda, V. Domínguez, R. Moreno, and F. González-Candelas, The molecular epidemiology of HIV-1 in the Comunidad Valenciana (Spain): analysis of transmission clusters, Scientific Reports, 7 (2017), p. 11584, https://doi.org/10.1038/s41598-017-10286-1.
[72] J. A. Perea and G. Carlsson, A Klein-Bottle-Based Dictionary for Texture Representation, Int. J. Comput. Vision, 107 (2014), pp. 75-97, https://doi.org/10.1007/s11263-013-0676-2. · Zbl 1328.68279
[73] M. Pérez-Losada, M. Arenas, J. C. Galán, F. Palero, and F. González-Candelas, Recombination in viruses: Mechanisms, methods of study, and evolutionary consequences, Infection, Genetics and Evolution, 30 (2015), pp. 296-307, http://www.sciencedirect.com/science/article/pii/S156713481400478X.
[74] E. J. G. Pitman, Sufficient statistics and intrinsic accuracy, Math. Proc. Cambridge Philosoph. Soc., 32 (1936), pp. 567-579. · Zbl 0015.36201
[75] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt, A stable multi-scale kernel for topological machine learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4741-4748.
[76] A. Robinson and K. Turner, Hypothesis testing for topological data analysis, J. Appl. Comput. Topology, 1 (2017), pp. 241-261, https://doi.org/10.1007/s41468-017-0008-7. · Zbl 1396.62085
[77] F. Ronquist, M. Teslenko, P. van der Mark, D. L. Ayres, A. Darling, S. Höhna, B. Larget, L. Liu, M. A. Suchard, and J. P. Huelsenbeck, MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space, Syst. Biol., 61 (2012), pp. 539-542, https://doi.org/10.1093/sysbio/sys029.
[78] N. Saitou and M. Nei, The Neighbor-Joining Method: A new method for reconstructing phylogenetic trees, Molecular Biol. Evolution, 4 (1987), pp. 406-425, https://doi.org/10.1093/oxfordjournals.molbev.a040454.
[79] D. L. Swofford, PAUP*: Phylogenetic Analysis Using Parsimony (and other methods) 4.0.b5, 2001.
[80] The Wellcome Trust Case Control Consortium, Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls, Nature, 447 (2007), pp. 661-6678, http://dx.doi.org/10.1038/nature05911.
[81] H. Tian, S. Zhou, L. Dong, T. P. Van Boeckel, Y. Cui, S. H. Newman, J. Y. Takekawa, D. J. Prosser, X. Xiao, Y. Wu, B. Cazelles, S. Huang, R. Yang, B. T. Grenfell, and B. Xu, Avian influenza H5N1 viral and bird migration networks in Asia, Proc. Natl. Acad. Sci., 112 (2015), pp. 172-177, http://www.pnas.org/content/112/1/172.abstract.
[82] M. C. Turchin, C. W. Chiang, C. D. Palmer, S. Sankararaman, D. Reich, J. N. Hirschhorn, and G. I. of ANthropometric Traits (GIANT) Consortium, Evidence of widespread selection on standing variation in Europe at height-associated SNPs, Nature Genetics, 44 (2012), 1015.
[83] K. Turner, Y. Mileyko, S. Mukherjee, and J. Harer, Fréchet means for distributions of persistence diagrams, Discrete Comput. Geometry, 52 (2014), pp. 44-70. · Zbl 1296.68182
[84] K. Turner, S. Mukherjee, and D. M. Boyer, Persistent homology transform for modeling shapes and surfaces, Inform. Inference, 3 (2014), pp. 310-344. · Zbl 06840289
[85] A. W. Van der Vaart, Asymptotic Statistics, Vol. 3, Cambridge University Press, Cambridge, 2000.
[86] C. J. Willer, E. K. Speliotes, R. J. Loos, S. Li, C. M. Lindgren, I. M. Heid, S. I. Berndt, A. L. Elliott, A. U. Jackson, and C. Lamina, Six new loci associated with body mass index highlight a neuronal influence on body weight regulation, Nature Genetics, 41 (2009), p. 25.
[87] M. Worobey, T. D. Watts, R. A. McKay, M. A. Suchard, T. Granade, D. E. Teuwen, B. A. Koblin, W. Heneine, P. Lemey, and H. W. Jaffe, 1970s and “Patient 0” HIV-1 genomes illuminate early HIV/AIDS history in North America, Nature, 539 (2016), pp. 98-101.
[88] S. Yan and G. Wu, Possible reason for cross-species and cross-subtype reassortment in polymerase basic protein 2 from influenza A virus, Protein and Peptide Letters, 18 (2011), pp. 434-439.
[89] J. Yang, T. Ferreira, A. P. Morris, S. E. Medland, P. A. Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, M. N. Weedon, and R. J. Loos, Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits, Nature Genetics, 44 (2012), p. 369.
[90] Z. Yang, A space-time process model for the evolution of DNA sequences, Genetics, 139 (1995), pp. 993-1005.
[91] A. J. Zomorodian and G. Carlsson, Computing persistent homology, Discrete Comput. Geometry, 33 (2005), pp. 249-274. · Zbl 1069.55003
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.