Sequential importance sampling for multiresolution Kingman-Tajima coalescent counting. (English) Zbl 1446.62269

Summary: Statistical inference of evolutionary parameters from molecular sequence data relies on coalescent models to account for the shared genealogical ancestry of the samples. However, inferential algorithms do not scale to available data sets. A strategy to improve computational efficiency is to rely on simpler coalescent and mutation models, resulting in smaller hidden state spaces. An estimate of the cardinality of the state space of genealogical trees at different resolutions is essential to decide the best modeling strategy for a given dataset. To our knowledge, there is neither an exact nor approximate method to determine these cardinalities. We propose a sequential importance sampling algorithm to estimate the cardinality of the sample space of genealogical trees under different coalescent resolutions. Our sampling scheme proceeds sequentially across the set of combinatorial constraints imposed by the data which, in this work, are completely linked sequences of DNA at a nonrecombining segment. We analyze the cardinality of different genealogical tree spaces on simulations to study the settings that favor coarser resolutions. We apply our method to estimate the cardinality of genealogical tree spaces from mtDNA data from the 1000 genomes and a sample from a Melanesian population at the \(\beta \)-globin locus.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62L10 Sequential statistical analysis
62D05 Sampling theory, sample surveys
92D20 Protein sequences, DNA sequences


BEAST; BEAUti; ape
Full Text: DOI arXiv Euclid


[1] Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526 68 EP.
[2] Anderson, S., Bankier, A. T., Barrell, B. G., de Bruijn, M. H., Coulson, A. R., Drouin, J., Eperon, I. C., Nierlich, D. P., Roe, B. A. et al. (1981). Sequence and organization of the human mitochondrial genome. Nature 290 457.
[3] Andrews, R. M., Kubacka, I., Chinnery, P. F., Lightowlers, R. N., Turnbull, D. M. and Howell, N. (1999). Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat. Genet. 23 147.
[4] Behar, D. M., Van Oven, M., Rosset, S., Metspalu, M., Loogväli, E.-L., Silva, N. M., Kivisild, T., Torroni, A. and Villems, R. (2012). A “Copernican” reassessment of the human mitochondrial DNA tree from its root. Am. J. Hum. Genet. 90 675-684.
[5] Blanchet, J. and Rudoy, D. (2009). Rare event simulation and counting problems. In Rare Event Simulation Using Monte Carlo Methods 171-192. Wiley, Chichester. · Zbl 1175.65014
[6] Blitzstein, J. and Diaconis, P. (2010). A sequential importance sampling algorithm for generating random graphs with prescribed degrees. Internet Math. 6 489-522. · Zbl 1238.60084
[7] Cayley, A. (1856). Note sur une formule pour la reversion des séries. J. Reine Angew. Math. 52 276-284.
[8] Chatterjee, S. and Diaconis, P. (2018). The sample size required in importance sampling. Ann. Appl. Probab. 28 1099-1135. · Zbl 1391.65008
[9] Chen, Y. and Chen, Y. (2018). An efficient sampling algorithm for network motif detection. J. Comput. Graph. Statist. 27 503-515.
[10] Chen, Y., Diaconis, P., Holmes, S. P. and Liu, J. S. (2005). Sequential Monte Carlo methods for statistical analysis of tables. J. Amer. Statist. Assoc. 100 109-120. · Zbl 1117.62310
[11] Diaconis, P. (2018). Sequential importance sampling for estimating the number of perfect matchings in bipartite graphs: An ongoing conversation with Laci. Preprint. · Zbl 1443.05156
[12] Disanto, F. and Wiehe, T. (2013). Exact enumeration of cherries and pitchforks in ranked trees under the coalescent model. Math. Biosci. 242 195-200. · Zbl 1261.92040
[13] Drummond, A. J., Suchard, M. A., Xie, D. and Rambaut, A. (2012). Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol. 29 1969-1973.
[14] Ferretti, L., Ledda, A., Wiehe, T., Achaz, G. and Ramos-Onsins, S. E. (2017). Decomposing the site frequency spectrum: The impact of tree topology on neutrality tests. Genetics 207 229-240.
[15] Fullerton, S. M., Harding, R. M., Boyce, A. J. and Clegg, J. B. (1994). Molecular and population genetic analysis of allelic sequence diversity at the human beta-globin locus. Proc. Natl. Acad. Sci. USA 91 1805-1809.
[16] Gao, F. and Keinan, A. (2016). Inference of super-exponential human population growth via efficient computation of the site frequency spectrum for generalized models. Genetics 202 235-245.
[17] Gattepaille, L., Günther, T. and Jakobsson, M. (2016). Inferring past effective population size from distributions of coalescent times. Genetics 204 1191-1206.
[18] Griffiths, R. C. (1987). Counting genealogical trees. J. Math. Biol. 25 423-431. · Zbl 0652.92011
[19] Griffiths, R. C. and Marjoram, P. (1997). An ancestral recombination graph. In Progress in Population Genetics and Human Evolution (Minneapolis, MN, 1994). IMA Vol. Math. Appl. 87 257-270. Springer, New York. · Zbl 0893.92020
[20] Griffiths, R. C. and Tavaré, S. (1999). The ages of mutations in gene trees. Ann. Appl. Probab. 9 567-590. · Zbl 0948.92016
[21] Gusfield, D. (1991). Efficient algorithms for inferring evolutionary trees. Networks 21 19-28. · Zbl 0719.92015
[22] Gusfield, D. (2014). ReCombinatorics: The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks. MIT Press, Cambridge, MA. With contributions from Charles H. Langley, Yun S. Song and Yufeng Wu. · Zbl 1316.92001
[23] Hammersley, J. M. and Handscomb, D. C. (1965). Monte Carlo Methods. Methuen & Co., Ltd., London; Barnes & Noble, Inc., New York.
[24] Harding, R. M., Fullerton, S. M., Griffiths, R. C., Bond, J., Cox, M. J., Schneider, J. A., Moulin, D. S. and Clegg, J. B. (1997). Archaic African and Asian lineages in the genetic ancestry of modern humans. Am. J. Hum. Genet. 60 772-789.
[25] Jerrum, M. and Sinclair, A. (1996). The Markov chain Monte Carlo method: An approach to approximate counting and integration. In Approximation Algorithms for NP-Hard Problems 482-520.
[26] Jerrum, M. R., Valiant, L. G. and Vazirani, V. V. (1986). Random generation of combinatorial structures from a uniform distribution. Theoret. Comput. Sci. 43 169-188. · Zbl 0597.68056
[27] Kimura, M. (1969). The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61 893.
[28] Kingman, J. F. C. (1982). On the genealogy of large populations. J. Appl. Probab. Special Vol. 19A 27-43. Essays in statistical science. · Zbl 0516.92011
[29] Knuth, D. E. (1976). Mathematics and computer science: Coping with finiteness. Science 194 1235-1242. · Zbl 1225.68001
[30] Knuth, D. E. (2018). Art of Computer Programming, Volume 4B, Fascicle 5: The: Mathematical Preliminaries Redux; Backtracking; Dancing Links. Addison Wesley Professional.
[31] Liu, D., Shi, W., Shi, Y., Wang, D., Xiao, H., Li, W., Bi, Y., Wu, Y., Li, X. et al. (2013). Origin and diversity of novel avian influenza A H7N9 viruses causing human infection: Phylogenetic, structural, and coalescent analyses. Lancet 381 1926-1932.
[32] Maliet, O., Gascuel, F. and Lambert, A. (2018). Ranked tree shapes, nonrandom extinctions, and the loss of phylogenetic diversity. Syst. Biol. 67 1025-1040.
[33] Nordborg, M. (1998). On the probability of Neanderthal ancestry. Am. J. Hum. Genet. 63 1237-1240.
[34] Owen, A. B. (2013). Monte Carlo theory, methods and examples. Online.
[35] Palacios, J. A. and Minin, V. N. (2013). Gaussian process-based Bayesian nonparametric inference of population size trajectories from gene genealogies. Biometrics 69 8-18. · Zbl 1274.62852
[36] Palacios, J. A., Véber, A., Cappello, L., Wang, Z., Wakeley, J. and Ramachandran, S. (2019). Bayesian estimation of population size changes by sampling Tajima’s trees. Genetics 213 967-986.
[37] Paradis, E., Claude, J. and Strimmer, K. (2004). APE: Analyses of phylogenetics and evolution in R language. Bioinformatics 20 289-290.
[38] Rosenberg, N. A. and Nordborg, M. (2002). Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat. Rev. Genet. 3 380-390.
[39] Sainudiin, R., Stadler, T. and Véber, A. (2015). Finding the best resolution for the Kingman-Tajima coalescent: Theory and applications. J. Math. Biol. 70 1207-1247. · Zbl 1342.92144
[40] Sainudiin, R. and Véber, A. (2018). Full likelihood inference from the site frequency spectrum based on the optimal tree resolution. Theor. Popul. Biol. 124 1-40. · Zbl 1405.92183
[41] Sinclair, A. (2012). Algorithms for Random Generation and Counting: A Markov Chain Approach. Springer Science & Business Media. · Zbl 0780.68096
[42] Song, S., Pursell, Z. F., Copeland, W. C., Longley, M. J., Kunkel, T. A. and Mathews, C. K. (2005). DNA precursor asymmetries in mammalian tissue mitochondria and possible contribution to mutagenesis through reduced replication fidelity. Proc. Natl. Acad. Sci. USA 102 4990-4995.
[43] Steel, M. (2016). Phylogeny—Discrete and Random Processes in Evolution. CBMS-NSF Regional Conference Series in Applied Mathematics 89. SIAM, Philadelphia, PA. · Zbl 1361.92001
[44] Tajima, F. (1983). Evolutionary relationship of DNA sequences in finite populations. Genetics 105 437-460.
[45] Tavaré, S. (2004). Ancestral Inference in Population Genetics. Lectures on Probability Theory and Statistics: Ecole D’Eté de Probabilités de Saint-Flour XXXI-2001. Springer. · Zbl 1062.92046
[46] Terhorst, J., Kamm, J. A. and Song, Y. S. (2017). Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat. Genet. 49 303-309.
[47] Wang, L., Bouchard-Côté, A. and Doucet, A. (2015). Bayesian phylogenetic inference using a combinatorial sequential Monte Carlo method. J. Amer. Statist. Assoc. 110 1362-1374. · Zbl 1373.62555
[48] Watterson, G. · Zbl 0294.92011
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.