Power laws for family sizes in a duplication model. (English) Zbl 1099.92055

Diversification of protein folds in a genome is formulated as a multitype Yule process, starting with one individual. Each individual gives birth to a new individual at rate 1. When a new individual is born, it has the same type as its parent with probability \(1-r\) and is a new type, different from all previously observed types, with probability \(r\). Individuals of the same type are called families. Based on the connections between Yule processes and Pólya urns, an approximation is derived for the joint family-size distribution when the population size reaches \(N\). It is shown that if \(1\ll S\ll N^{1-r}\), then the number of families of size at least \(S\) is approximately \(\text{CNS}^{-1(1-r)}\). The limiting distribution of the size of the large families is derived and it is shown that if \(N^{1-r}\ll S\), the expected number of families of size at least \(S\) decays more rapidly than any power. It is shown that the model has a close relation to the ‘Chinese restaurant process’. Connections to preferential attachment models are discussed.


92D15 Problems related to evolution
60J80 Branching processes (Galton-Watson, birth-and-death, etc.)
60J85 Applications of branching processes
92D20 Protein sequences, DNA sequences
Full Text: DOI arXiv


[1] Aldous, D. J. (2001). Stochastic models and descriptive statistics for phylogenetic trees from Yule to today. Statist. Sci. 16 23–34. · Zbl 1127.60313
[2] Angerer, W. P. (2001). An explicit representation of the Luria–Delbrück distribution. J. Math. Biol. 42 145–174. · Zbl 1009.92028
[3] Angerer, W. P. and Wakolbinger, A. (2005). In preparation.
[4] Arratia, R., Barbour, A. D. and Tavaré, S. (2000). Limits of logarithmic combinatorial structures. Ann. Probab. 28 1620–1644. · Zbl 1044.60003
[5] Arratia, R. and Gordon, L. (1989). Tutorial on large deviations for the binomial distribution. Bull. Math. Biol. 51 125–131. · Zbl 0663.60028
[6] Athreya, K. B. and Karlin, S. (1968). Embedding of urn schemes into continuous time Markov branching processes. Ann. Math. Statist. 39 1801–1817. · Zbl 0185.46103
[7] Athreya, K. B. and Ney, P. E. (1972). Branching Processes . Springer, Berlin. · Zbl 0259.60002
[8] Barbási, A. L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286 509–512. · Zbl 1226.05223
[9] Berger, N., Borgs, C., Chayes, J. and Saberi, A. (2005). On the spread of viruses on the internet. In Proceedings of the 16th ACM—SIAM Symposium on Discrete Algorithms 301–310. SIAM, Philadelphia, PA. · Zbl 1297.68029
[10] Bollobás, B., Borgs, C., Chayes, J. and Riordan, O. (2003). Directed scale free graphs. In Proceedings of the 14th ACM—SIAM Symposium on Discrete Algorithms 132–139. SIAM, Philadelphia, PA. · Zbl 1094.68605
[11] Bollobás, B., Riordan, O., Spencer, J. and Tusnády, G. (2001). The degree sequence of a scale-free random graph process. Random Structures Algorithms 18 279–290. · Zbl 0985.05047
[12] Cooper, C. and Frieze, A. (2003). A general model for web graphs. Random Structures Algorithms 22 311–335. · Zbl 1018.60007
[13] Durrett, R. (1996). Probability : Theory and Examples , 2nd ed. Duxbury, Belmont, CA. · Zbl 1202.60001
[14] Durrett, R. (2002). Probability Models for DNA Sequence Evolution . Springer, New York. · Zbl 0991.92021
[15] Durrett, R. and Schweinsberg, J. (2004). Approximating selective sweeps. Theoret. Population Biol. 66 129–138. · Zbl 1111.92042
[16] Ewens, W. J. (1972). The sampling theory of selectively neutral alleles. Theoret. Population Biol. 3 87–112. · Zbl 0245.92009
[17] Gu, Z., Cavalcanti, A., Chen, F.-C., Bouman, P. and Li, W.-H. (2002). Extent of gene duplication in the genomes of Drosophila , nematode, and yeast. Mol. Biol. Evol. 19 256–262.
[18] Harrison, P. M. and Gerstein, M. (2002). Studying genomes through the aeons: Protein families, pseudogenes, and proteome evolution. J. Mol. Biol. 318 1155–1174.
[19] Huynen, M. A. and van Nimwegen, E. (1998). The frequency distribution of gene family sizes in complete genomes. Mol. Biol. Evol. 15 583–589.
[20] Janson, S. (2004). Functional limit theorems for multitype branching processes and generalized Pólya urns. Stochastic Process. Appl. 110 177–245. · Zbl 1075.60109
[21] Janson, S. (2004). Limit theorems for triangular urn schemes. Preprint. Available at http://www. math.uu.se/ svante/papers/index.html. · Zbl 1112.60012
[22] Johnson, N. L., Kotz, S. and Kemp, A. W. (1992). Univariate Discrete Distributions . Wiley, New York. · Zbl 0773.62007
[23] Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezov, F. S. and Koonin, E. V. (2002). Birth and death of protein domains: A simple model explains power law behavior. BMC Evolutionary Biology 2 article 18.
[24] Kingman, J. F. C. (1982). The coalescent. Stochastic Process. Appl. 13 235–248. · Zbl 0491.60076
[25] Koonin, E. V., Wolf, Y. I. and Karev, G. P. (2002). The structure of the protein universe and genome evolution. Nature 420 218–223.
[26] Krapivsky, P. L., Redner S. and Leyvraz, F. (2000). Connectivity of growing random networks. Phys. Rev. Lett. 85 4629–4632.
[27] Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A. and Upfal, E. (2000). Stochastic models for the web graph. In Proceedings of the 41st IEEE Symposium on the Foundations of Computer Science 57–65.
[28] Li, W.-H., Gu, Z., Wang, H. and Nekrutenko, A. (2001). Evolutionary analyses of the human genome. Nature 409 847–849. · Zbl 1019.81044
[29] Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probab. Theory Related Fields 102 145–158. · Zbl 0821.60047
[30] Pitman, J. (2002). Combinatorial stochastic processes. Lecture Notes for St. Flour Summer School. Available at http://stat-www.berkeley.edu/users/pitman/bibliog.html.
[31] Pitman, J. and Yor, M. (1997). The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25 855–900. · Zbl 0880.60076
[32] Qian, J., Luscombe, N. M. and Gerstein, M. (2001). Protein family and fold occurrence in genomes: Power-law behavior and evolutionary model. J. Mol. Biol. 313 673–681.
[33] Rzhetsky, A. and Gomez, S. M. (2001). Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17 988–996.
[34] Schweinsberg, J. and Durrett, R. (2004). Random partitions approximating the coalescence of lineages during a selective sweep. Preprint. Available at http://front.math.ucdavis.edu/math.PR/ 0411069. · Zbl 1073.92029
[35] Simon, H. A. (1955). On a class of skew distribution functions. Biometrika 42 425–440. · Zbl 0066.11201
[36] Skorohod, A. V. (1961). Asymptotic formulas for stable distribution laws. Selected Translations in Mathematical Statistics and Probability 1 157–161. · Zbl 0112.10107
[37] Yule, G. U. (1925). A mathematical theory of evolution based on the conclusions of Dr. J. C. Willis. Philos. Trans. Roy. Soc. London Ser. B 213 21–87.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.