×

zbMATH — the first resource for mathematics

Loglinear model selection and human mobility. (English) Zbl 1405.62065
Summary: Methods for selecting loglinear models were among Steve Fienberg’s research interests since the start of his long and fruitful career. After we dwell upon the string of papers focusing on loglinear models that can be partly attributed to Steve’s contributions and influential ideas, we develop a new algorithm for selecting graphical loglinear models that is suitable for analyzing hyper-sparse contingency tables. We show how multi-way contingency tables can be used to represent patterns of human mobility. We analyze a dataset of geolocated tweets from South Africa that comprises 46 million latitude/longitude locations of 476,601 Twitter users that is summarized as a contingency table with 214 variables.

MSC:
62H17 Contingency tables
62F15 Bayesian inference
62P25 Applications of statistics to social sciences
Software:
BDgraph; GitHub; HdBCS; smappR
PDF BibTeX XML Cite
Full Text: DOI Euclid
References:
[1] Agresti, A. (1990). Categorical Data Analysis. Wiley, New York. · Zbl 0716.62001
[2] Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Rev. Modern Phys.74 47–97.
[3] Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics16 412–424.
[4] Baltazar, C. S., Horth, R., Inguane, C., Sathane, I., César, F., Ricardo, H., Botão, C., Augusto, Â., Cooley, L., Cummings, B., Raymond, H. F. and Young, P. W. (2015). HIV prevalence and risk behaviors among Mozambicans working in South African mines. AIDS Behav.19 59–67.
[5] Becker, R., Cáceres, R., Hanson, K., Isaacman, S., Loh, J. M., Martonosi, M., Rowland, J., Urbanek, S., Varshavsky, A. and Volinsky, C. (2013). Human mobility characterization from cellular network data. Commun. ACM56 74–82.
[6] Besag, J. (1975). Statistical analysis of non-lattice data. J. R. Stat. Soc., Ser. D Stat.24 179–195.
[7] Besag, J. (1977). Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika64 616–618. · Zbl 0372.62067
[8] Bhattacharya, A. and Dunson, D. B. (2012). Simplex factor models for multivariate unordered categorical data. J. Amer. Statist. Assoc.107 362–377. · Zbl 1263.62097
[9] Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. With the collaboration of Richard J. Light and Frederick Mosteller. · Zbl 0332.62039
[10] Brockmann, D., Hufnagel, L. and Geisel, T. (2006). The scaling laws of human travel. Nature439 462–465.
[11] Calabrese, F., Diao, M., Lorenzo, G. D., Ferreira Jr., J. and Ratti, C. (2013). Understanding individual mobility patterns from urban sensing data: A mobile phone trace example. Transp. Res., Part C, Emerg. Technol.26 301–313.
[12] Canale, A. and Dunson, D. B. (2011). Bayesian kernel mixtures for counts. J. Amer. Statist. Assoc.106 1528–1539. · Zbl 1233.62041
[13] Cappé, O., Robert, C. P. and Rydén, T. (2003). Reversible jump, birth-and-death and more general continuous time Markov chain Monte Carlo samplers. J. R. Stat. Soc. Ser. B. Stat. Methodol.65 679–700. · Zbl 1063.62133
[14] Cheng, Y. and Lenkoski, A. (2012). Hierarchical Gaussian graphical models: Beyond reversible jump. Electron. J. Stat.6 2309–2331. · Zbl 1335.62042
[15] Clyde, M. and George, E. I. (2004). Model uncertainty. Statist. Sci.19 81–94. · Zbl 1062.62044
[16] Dellaportas, P. and Forster, J. J. (1999). Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika86 615–633. · Zbl 0949.62050
[17] Dellaportas, P. and Tarantola, C. (2005). Model determination for categorical data with factor level merging. J. R. Stat. Soc. Ser. B. Stat. Methodol.67 269–283. · Zbl 1069.62049
[18] Descombes, X., Minlos, R. and Zhizhina, E. (2009). Object extraction using a stochastic birth-and-death dynamics in continuum. J. Math. Imaging Vision33 347–359.
[19] Dobra, A. and Lenkoski, A. (2011). Copula Gaussian graphical models and their application to modeling functional disability data. Ann. Appl. Stat.5 969–993. · Zbl 1232.62046
[20] Dobra, A., Lenkoski, A. and Rodriguez, A. (2011). Bayesian inference for general Gaussian graphical models with application to multivariate lattice data. J. Amer. Statist. Assoc.106 1418–1433. · Zbl 1234.62018
[21] Dobra, A. and Massam, H. (2010). The mode oriented stochastic search (MOSS) algorithm for log-linear models with conjugate priors. Stat. Methodol.7 240–253. · Zbl 1291.62066
[22] Dobra, A. and Mohammadi, R. (2018). Supplement to “Loglinear model selection and human mobility.” DOI:10.1214/18-AOAS1164SUPP.
[23] Dobra, A., Williams, N. E. and Eagle, N. (2015). Spatiotemporal detection of unusual human population behavior using mobile phone data. PLoS ONE10 1–20.
[24] Dobra, A., Bärnighausen, T., Vandormael, A. and Tanser, F. (2017). Space-time migration patterns and risk of HIV acquisition in rural South Africa. AIDS31 37–145.
[25] Donato, K. M. (1993). Current trends and patterns of female migration: Evidence from Mexico. Int. Migr. Rev.27 748–771.
[26] Drton, M. and Maathuis, M. H. (2017). Structure learning in graphical modeling. Annu. Rev. Statist. Appl.4 365–393.
[27] Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc.104 1042–1051. · Zbl 1388.62151
[28] Durand, J., Kandel, W., Parrado, E. A. and Massey, D. S. (1996). International migration and development in Mexican communities. Demography33 249–264.
[29] Edwards, D. and Havránek, T. (1985). A fast procedure for model search in multidimensional contingency tables. Biometrika72 339–351. · Zbl 0576.62067
[30] Fienberg, S. E. (1970). The analysis of multidimensional contingency tables. Ecology51 419–433.
[31] Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data, 2nd ed. MIT Press, Cambridge, MA. · Zbl 0499.62049
[32] Fienberg, S. E. and Rinaldo, A. (2007). Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation. J. Statist. Plann. Inference137 3430–3445. · Zbl 1119.62053
[33] Fienberg, S. E. and Rinaldo, A. (2012). Maximum likelihood estimation in log-linear models. Ann. Statist.40 996–1023. · Zbl 1274.62389
[34] Gamal-Eldin, A., Descombes, X. and Zerubia, J. (2010). Multiple birth and cut algorithm for point process optimization. In 2010 Sixth International Conference on Signal-Image Technology and Internet-Based Systems (SITIS) 35–42. IEEE, Los Alamitos, CA.
[35] Gamal-Eldin, A., Descombes, X., Charpiat, G. and Zerubia, J. (2011). A fast multiple birth and cut algorithm using belief propagation. In 2011 18th IEEE International Conference on Image Processing 2813–2816. IEEE, Los Alamitos, CA.
[36] Gonzalez, M. C., Hidalgo, C. A. and Barabasi, A.-L. (2008). Understanding individual human mobility patterns. Nature453 779–782.
[37] Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika82 711–732. · Zbl 0861.62023
[38] Guerzhoy, M. and Hertzmann, A. (2014). Learning latent factor models of travel data for travel prediction and analysis. In Advances in Artificial Intelligence. Lecture Notes in Computer Science8436 131–142. Springer, Cham.
[39] Harris, J. R. and Todaro, M. P. (1970). Migration, unemployment and development: A two-sector analysis. Am. Econ. Rev.60 126–142.
[40] Hoff, P. D. (2008). Multiplicative latent factor models for description and prediction of social networks. Comput. Math. Organ. Theory15 Art. ID 261.
[41] Höfling, H. and Tibshirani, R. (2009). Estimation of sparse binary pairwise Markov networks using pseudo-likelihoods. J. Mach. Learn. Res.10 883–906. · Zbl 1245.62121
[42] Højsgaard, S., Edwards, D. and Lauritzen, S. (2012). Graphical Models with R. Springer, New York. · Zbl 1286.62005
[43] Imai, K. (2017). Quantitative Social Science: An Introduction. Princeton Univ. Press, Princeton, NJ.
[44] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci.20 388–400. · Zbl 1130.62408
[45] Jurdak, R., Zhao, K., Liu, J., AbouJaoude, M., Cameron, M. and Newth, D. (2015). Understanding human mobility from Twitter. PLoS ONE10 1–16.
[46] Kunihama, T. and Dunson, D. B. (2013). Bayesian modeling of temporal dependence in large sparse contingency tables. J. Amer. Statist. Assoc.108 1324–1338. · Zbl 1283.62120
[47] Lauritzen, S. L. (1996). Graphical Models. Oxford Statistical Science Series17. The Clarendon Press, Oxford Univ. Press, New York.
[48] Leetaru, K., Wang, S., Cao, G., Padmanabhan, A. and Shook, E. (2013). Mapping the global Twitter heartbeat: The geography of Twitter. First Monday18. Available at http://firstmonday.org/ojs/index.php/fm/article/view/4366/3654.
[49] Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in Gaussian graphical models with the G-Wishart prior. J. Comput. Graph. Statist.20 140–157. Supplementary material available online.
[50] Letac, G. and Massam, H. (2012). Bayes factors and the geometry of discrete hierarchical loglinear models. Ann. Statist.40 861–890. · Zbl 1274.62391
[51] Madigan, D. and Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Amer. Statist. Assoc.89 1535–1546. · Zbl 0814.62030
[52] Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. Int. Stat. Rev.63 215–232. · Zbl 0834.62003
[53] Madigan, D. and York, J. C. (1997). Bayesian methods for estimation of the size of a closed population. Biometrika84 19–31. · Zbl 0887.62029
[54] Madigan, D., Raftery, A. E., Volinsky, C. and Hoeting, J. (1996). Bayesian model averaging. In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models 77–83.
[55] Massam, H., Liu, J. and Dobra, A. (2009). A conjugate prior for discrete hierarchical log-linear models. Ann. Statist.37 3431–3467. · Zbl 1369.62048
[56] Massey, D. S. (1990). Social structure, household strategies, and the cumulative causation of migration. Popul. Index56 3–26.
[57] Massey, D. S. and Espinosa, K. E. (1997). What’s driving Mexico–U.S. migration? A theoretical, empirical, and policy analysis. Am. J. Sociol.102 939–999.
[58] Massey, D. S., Arango, J., Hugo, G., Kouaouci, A., Pellegrino, A. and Taylor, J. E. (1993). Theories of international migration: A review and appraisal. Popul. Dev. Rev.19 431–466.
[59] Massey, D. S., Williams, N., Axinn, W. G. and Ghimire, D. (2010). Community services and out-migration. Int. Migr.48 1–41.
[60] Mohammadi, A. and Dobra, A. (2017). The R package BDgraph for Bayesian structure learning in graphical models. ISBA Bull.4 11–16.
[61] Mohammadi, A., Massam, H. and Letac, G. (2017). The ratio of normalizing constants for Bayesian graphical Gaussian model selection. Preprint. Available at arXiv:1706.04416.
[62] Mohammadi, A. and Wit, E. C. (2015). Bayesian structure learning in sparse Gaussian graphical models. Bayesian Anal.10 109–138. · Zbl 1335.62056
[63] Mohammadi, R. and Wit, E. C. (2017). BDgraph: An R package for Bayesian structure learning in graphical models. Preprint. Available at arXiv:1501.05108v4.
[64] Mohammadi, A., Abegaz, F., van den Heuvel, E. and Wit, E. C. (2017). Bayesian modelling of Dupuytren disease by using Gaussian copula graphical models. J. R. Stat. Soc. Ser. C. Appl. Stat.66 629–645.
[65] Nardi, Y. and Rinaldo, A. (2012). The log-linear group-lasso estimator and its asymptotic properties. Bernoulli18 945–974. · Zbl 1243.62107
[66] Neubauer, G., Huber, H., Vogl, A., Jager, B., Preinerstorfer, A., Schirnhofer, S., Schimak, G. and Havlik, D. (2015). On the volume of geo-referenced tweets and their relationship to events relevant for migration tracking. In Environmental Software Systems. Infrastructures, Services and Applications: 11th IFIP WG 5.11 International Symposium, ISESS 2015, Melbourne, VIC, Australia, March 25–27, 2015. Proceedings (R. Denzer, R. M. Argent, G. Schimak and J. Hřebíček, eds.) 520–530. Springer, Cham.
[67] Pensar, J., Nyman, H., Niiranen, J. and Corander, J. (2017). Marginal pseudo-likelihood learning of discrete Markov network structures. Bayesian Anal.12 1195–1215. · Zbl 1384.62178
[68] Preston, C. (1975). Spatial birth-and-death processes. Bull. Inst. Int. Stat.46 371–391, 405–408 (1975). With discussion. · Zbl 0379.60082
[69] Ravikumar, P., Wainwright, M. J. and Lafferty, J. D. (2010). High-dimensional Ising model selection using \(ℓ_{1}\)-regularized logistic regression. Ann. Statist.38 1287–1319. · Zbl 1189.62115
[70] Raymer, J., Abel, G. and Smith, P. W. F. (2007). Combining census and registration data to estimate detailed elderly migration flows in England and Wales. J. Roy. Statist. Soc. Ser. A170 891–908.
[71] Raymer, J., Wiśniowski, A., Forster, J. J., Smith, P. W. F. and Bijak, J. (2013). Integrated modeling of European migration. J. Amer. Statist. Assoc.108 801–819. · Zbl 06224967
[72] Scott, J. G. and Carvalho, C. M. (2008). Feature-inclusion stochastic search for Gaussian graphical models. J. Comput. Graph. Statist.17 790–808.
[73] SMaPP (2017). smappR package: Tools for analysis of Twitter data, Social Media and Participation, New York University. Available at https://github.com/SMAPPNYU/smappR.
[74] Smith, P. W. F., Raymer, J. and Giulietti, C. (2010). Combining available migration data in England to study economic activity flows over time. J. Roy. Statist. Soc. Ser. A173 733–753.
[75] Stark, O. and Bloom, D. E. (1985). The new economics of labor migration. Am. Econ. Rev.75 173–178.
[76] Stark, O. and Taylor, J. E. (1985). Migration incentives, migration types: The role of relative deprivation. Econ. J.101 1163–1178.
[77] Stopher, P. R. and Greaves, S. P. (2007). Household travel surveys: Where are we going? Transp. Res., Part A Policy Pract.41 367–381.
[78] Tarantola, C. (2004). MCMC model determination for discrete graphical models. Stat. Model.4 39–61. · Zbl 1111.62025
[79] Tatem, A. J. (2014). Mapping population and pathogen movements. Int. Health6 5–11.
[80] Taylor, J. E. (1987). Undocumented Mexico–U.S. migration and the returns to households in rural Mexico. Am. J. Agric. Econ.69 616–638.
[81] Todaro, M. P. (1969). A model of labor migration and urban unemployment in less developed countries. Am. Econ. Rev.59 138–148.
[82] Todaro, M. P. and Maruszko, L. (1987). Illegal immigration and U.S. immigration reform: A conceptual framework. Popul. Dev. Rev.13 101–114.
[83] Tsamardinos, I., Brown, L. E. and Aliferis, C. F. (2006). The max–min hill-climbing Bayesian network structure learning algorithm. Mach. Learn.65 31–78.
[84] Twitter, Inc. (2017). Twitter REST APIs. Available at https://dev.twitter.com/rest/public.
[85] VanWey, L. K. (2005). Land ownership as a determinant of international and internal migration in Mexico and internal migration in Thailand. Int. Migr. Rev.39 141–172.
[86] Wainwright, M. and Jordan, M. (2008). Graphical models, exponential families and variational inference. Found. Trends Mach. Learn.1 1–305. · Zbl 1193.62107
[87] Wang, H. and Li, S. Z. (2012). Efficient Gaussian graphical model determination under \(G\)-Wishart prior distributions. Electron. J. Stat.6 168–198. · Zbl 1335.62069
[88] Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, Chichester. · Zbl 0732.62056
[89] Williams, N. (2009). Education, gender, and migration in the context of social change. Soc. Sci. Res.38 883–896.
[90] Williams, N. E., Thomas, T. A., Dunbar, M., Eagle, N. and Dobra, A. (2015). Measures of human mobility using mobile phone records enhanced with GIS data. PLoS ONE10 1–16.
[91] Wolf, J., Oliveira, M. and Thompson, M. (2003). Impact of underreporting on mileage and travel time estimates: Results from global positioning system-enhanced household travel survey. Transp. Res. Rec.1854 189–198.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.