Fitting birth-death processes to panel data with applications to bacterial DNA fingerprinting. (English) Zbl 1283.92027

Summary: Continuous-time linear birth-death-immigration (BDI) processes are frequently used in ecology and epidemiology to model the stochastic dynamics of the population of interest. In clinical settings, multiple birth-death processes can describe disease trajectories of individual patients, allowing for estimation of the effects of individual covariates on the birth and death rates of the process. Such estimation is usually accomplished by analyzing patient data collected at unevenly spaced time points, referred to as panel data in the biostatistics literature. Fitting linear BDI processes to panel data is a nontrivial optimization problem because birth and death rates can be functions of many parameters related to the covariates of interest.
We propose a novel expectation-maximization (EM) algorithm for fitting linear BDI models with covariates to panel data. We derive a closed-form expression for the joint generating function of some of the BDI process statistics and use this generating function to reduce the E-step of the EM algorithm, as well as calculation of the Fisher information, to one-dimensional integration. This analytical technique yields a computationally efficient and robust optimization algorithm that we implemented in an open-source R package.
We apply our method to DNA fingerprinting of Mycobacterium tuberculosis, the causative agent of tuberculosis, to study intrapatient time evolution of IS6110 copy number, a genetic marker frequently used during estimation of epidemiological clusters of Mycobacterium tuberculosis infections. Our analysis reveals previously undocumented differences in IS6110 birth-death rates among three major lineages of Mycobacterium tuberculosis, which has important implications for epidemiologists that use IS6110 for DNA fingerprinting of Mycobacterium tuberculosis.


92C40 Biochemistry, molecular biology
92C60 Medical epidemiology
60J85 Applications of branching processes
92-04 Software, source code, etc. for problems pertaining to biology
92-08 Computational methods for problems pertaining to biology
62P10 Applications of statistics to biology and medical sciences; meta analysis


R; msm
Full Text: DOI arXiv Euclid


[1] Alonso, H., Aguilo, J. I., Samper, S., Caminero, J. A., Campos-Herrero, M. I., Gicquel, B., Brosch, R., Martín, C. and Otal, I. (2011). Deciphering the role of IS6110 in a highly transmissible Mycobacterium tuberculosis Beijing strain, GC1237. Tuberculosis 91 117-126.
[2] Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41 164-171. · Zbl 0188.49603
[3] Cattamanchi, A., Hopewell, P. C., Gonzalez, L. C., Osmond, D. H., Masae, Kawamura, L., Daley, C. L. and Jasmer, R. M. (2006). A 13-year molecular epidemiological analysis of tuberculosis in San Francisco. The International Journal of Tuberculosis and Lung Disease 10 297-304.
[4] Crespi, C. M., Cumberland, W. G. and Blower, S. (2005). A queueing model for chronic recurrent conditions under panel observation. Biometrics 61 193-198. · Zbl 1077.62105
[5] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1-38. · Zbl 0364.62022
[6] Dorman, K. S., Sinsheimer, J. S. and Lange, K. (2004). In the garden of branching processes. SIAM Rev. 46 202-229 (electronic). · Zbl 1069.60072
[7] Doss, C. R., Suchard, M. A., Holmes, I., Kato-Maeda, M. and Minin, V. N. (2013). Supplement to “Fitting birth-death processes to panel data with applications to bacterial DNA fingerprinting.” . · Zbl 1283.92027
[8] Gagneux, S., DeRiemer, K., Van, T., Kato-Maeda, M., de Jong, B. C., Narayanan, S., Nicol, M., Niemann, S., Kremer, K., Gutierrez, M. C., Hilty, M., Hopewell, P. C. and Small, P. M. (2006). Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc. Natl. Acad. Sci. USA 103 2869-2873.
[9] Gibson, G. J. and Renshaw, E. (1998). Estimating parameters in stochastic compartmental models using Markov chain methods. IMA Journal of Mathematics Applied in Medicine & Biology 15 19-40. · Zbl 0916.92024
[10] Golinelli, D. (2000). Bayesian inference in hidden stochastic population processes. Ph.D. thesis, Univ. Washington, Seattle, WA.
[11] Guttorp, P. (1995). Stochastic Modeling of Scientific Data . Chapman & Hall, London. · Zbl 0862.60034
[12] Henrici, P. (1979). Fast Fourier methods in computational complex analysis. SIAM Rev. 21 481-527. · Zbl 0416.65022
[13] Holmes, I. (2005). Using evolutionary expectation maximization to estimate indel rates. Bioinformatics 21 2294-2300.
[14] Holmes, I. and Rubin, G. M. (2002). An expectation maximization algorithm for training hidden substitution models. Journal of Molecular Biology 317 753-764.
[15] Jackson, C. H. (2011). Multi-state models for panel data: The msm package for R. Journal of Statistical Software 38 1-29.
[16] Jasmer, R. M., Hahn, J. A., Small, P. M., Daley, C. L., Behr, M. A., Moss, A. R., Creasman, J. M., Schecter, G. F., Paz, E. A. and Hopewell, P. C. (1999). A molecular epidemiologic analysis of tuberculosis trends in San Francisco, 1991-1997. Annals of Internal Medicine 130 971-978.
[17] Kalbfleisch, J. D. and Lawless, J. F. (1985). The analysis of panel data under a Markov assumption. J. Amer. Statist. Assoc. 80 863-871. · Zbl 0586.62136
[18] Karlin, S. and McGregor, J. (1958). Linear growth birth and death processes. J. Math. Mech. 7 643-662. · Zbl 0091.13804
[19] Kato-Maeda, M., Metcalfe, J. Z. and Flores, L. (2011). Genotyping of Mycobacterium tuberculosis: Application in epidemiologic studies. Future Microbiol. 6 203-216.
[20] Keiding, N. (1975). Maximum likelihood estimation in the birth-and-death process. Ann. Statist. 3 363-372. · Zbl 0302.62043
[21] Kendall, D. G. (1948). On the generalized “birth-and-death” process. Ann. Math. Statist. 19 1-15. · Zbl 0032.17604
[22] Lange, K. (1982). Calculation of the equilibrium distribution for a deleterious gene by the finite Fourier transform. Biometrics 38 79-86. · Zbl 0479.62086
[23] Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 425-437. · Zbl 0813.62021
[24] Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 44 226-233. · Zbl 0488.62018
[25] McEvoy, C. R. E., Falmer, A. A., van Pittius, N. C. G., Victor, T. C., van Helden, P. D. and Warren, R. M. (2007). The role of IS6110 in the evolution of Mycobacterium tuberculosis. Tuberculosis 87 393-404.
[26] Minin, V. N. and Suchard, M. A. (2008). Counting labeled transitions in continuous-time Markov models of evolution. J. Math. Biol. 56 391-412. · Zbl 1145.60323
[27] Nee, S. (2006). Birth-death models in macroevolution. Annual Review of Ecology , Evolution , and Systematics 37 1-17.
[28] Roberts, W. J. J. and Ephraim, Y. (2008). An EM algorithm for ion-channel current estimation. IEEE Trans. Signal Process. 56 26-33. · Zbl 1390.92031
[29] Rosenberg, N. A., Tsolaki, A. G. and Tanaka, M. M. (2003). Estimating change rates of genetic markers using serial samples: Applications to the transposon IS6110 in Mycobacterium tuberculosis. Theoretical Population Biology 63 347-363. · Zbl 1098.62147
[30] Sehl, M., Zhou, H., Sinsheimer, J. S. and Lange, K. L. (2011). Extinction models for cancer stem cell therapy. Math. Biosci. 234 132-146. · Zbl 1256.92028
[31] Small, P. M., Hopewell, P. C., Singh, S. P., Paz, A., Parsonnet, J., Ruston, D. C., Schecter, G. F., Daley, C. L. and Schoolnik, G. K. (1994). The epidemiology of tuberculosis in San Francisco. A population-based study using conventional and molecular methods. New England Journal of Medicine 330 1703-1709.
[32] Suchard, M. A., Lange, K. and Sinsheimer, J. S. (2008). Efficiency of protein production from mRNA. J. Stat. Theory Pract. 2 173-182.
[33] Tanaka, M. M. and Rosenberg, N. A. (2001). Optimal estimation of transposition rates of insertion sequences for molecular epidemiology. Stat. Med. 20 2409-2420.
[34] Thorne, J. L., Kishino, H. and Felsenstein, J. (1991). An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 33 114-124.
[35] van Embden, J. D., Cave, M. D., Crawford, J. T., Dale, J. W., Eisenach, K. D., Gicquel, B., Hermans, P., Martin, C., McAdam, R., Shinnick, T. M. et al. (1993). Strain identification of Mycobacterium tuberculosis by DNA fingerprinting: Recommendations for a standardized methodology. J. Clin. Microbiol. 31 406-409.
[36] Warren, R. M., van der Spuy, G. D., Richardson, M., Beyers, N., Booysen, C., Behr, M. A. and van Helden, P. D. (2002). Evolution of the IS6110-based restriction fragment length polymorphism pattern during the transmission of Mycobacterium tuberculosis. J. Clin. Microbiol. 40 1277-1282.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.