×

Clustering South African households based on their asset status using latent variable models. (English) Zbl 1454.62503

Summary: The Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described by clustering the households into homogeneous groups based on their asset status.
A model-based approach to clustering the Agincourt households, based on latent variable models, is proposed. In the case of modeling binary or ordinal items, item response theory models are employed. For nominal survey items, a factor analysis model, similar in nature to a multinomial probit model, is used. Both model types have an underlying latent variable structure-this similarity is exploited and the models are combined to produce a hybrid model capable of handling mixed data types. Further, a mixture of the hybrid models is considered to provide clustering capabilities within the context of mixed binary, ordinal and nominal response data. The proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD).
The MFA-MD model is applied to the survey data to cluster the Agincourt households into homogeneous groups. The model is estimated within the Bayesian paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings result, providing insight to the different socio-economic strata within the Agincourt region.

MSC:

62P25 Applications of statistics to social sciences
62F15 Bayesian inference
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P15 Applications of statistics to psychology

Software:

MULTIMIX; bfa; PRMLT; BayesDA; PGMM
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aguilar, O. and West, M. (2000). Bayesian dynamic factor models and portfolio allocation. J. Bus. Econom. Statist. 18 338-357.
[2] Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 669-679. · Zbl 0774.62031 · doi:10.2307/2290350
[3] Alkema, L., Faye, O., Mutua, M. and Zulu, E. (2008). Identifying poverty groups in Nairobi’s slum settlements: A latent class analysis approach. In Conference Paper for Annual Meeting of the Population Association of America . New Orleans.
[4] Bensmail, H., Celeux, G., Raftery, A. E. and Robert, C. P. (1997). Inference in model-based cluster analysis. Statist. Comput. 7 1-10.
[5] Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models. Biometrika 98 291-306. · Zbl 1215.62025 · doi:10.1093/biomet/asr013
[6] Bishop, C. M. (2006). Pattern Recognition and Machine Learning . Springer, New York. · Zbl 1107.68072
[7] Browne, R. P. and McNicholas, P. D. (2012). Model-based clustering, classification, and discriminant analysis of data with mixed type. J. Statist. Plann. Inference 142 2976-2984. · Zbl 1335.62093 · doi:10.1016/j.jspi.2012.05.001
[8] Cai, J.-H., Song, X.-Y., Lam, K.-H. and Ip, E. H.-S. (2011). A mixture of generalized latent variable models for mixed mode and heterogeneous data. Comput. Statist. Data Anal. 55 2889-2907. · Zbl 1218.62012 · doi:10.1016/j.csda.2011.05.011
[9] Celeux, G., Hurn, M. and Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. J. Amer. Statist. Assoc. 95 957-970. · Zbl 0999.62020 · doi:10.2307/2669477
[10] Chib, S., Greenberg, E. and Chen, Y. (1998). MCMC methods for fitting and comparing multinomial response models. Technical report, Washington Univ. in St. Louis.
[11] Collinson, M. A., Clark, S. J., Gerritsen, A. A. M., Byass, P., Kahn, K. and Tollmann, S. M. (2009). The dynamics of poverty and migration in a rural south african community, 2001-2005. Technical report, Center for Statistics and the Social Sciences Univ. of Washington.
[12] Cowles, M. K. (1996). Accelerating Monte Carlo Markov chain convergence for cumulative-link generalized linear models. Statist. Comput. 6 101-111.
[13] Erikson, R. and Goldthorpe, J. H. (1992). The Constant Flux : A Study of Class Mobility in Industrial Societies . Oxford Univ. Press, London.
[14] Erosheva, E. A., Fienberg, S. E. and Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. Ann. Appl. Stat. 1 502-537. · Zbl 1126.62101 · doi:10.1214/07-AOAS126
[15] Everitt, B. S. (1988). A finite mixture model for the clustering of mixed-mode data. Statist. Probab. Lett. 6 305-309.
[16] Everitt, B. S. and Merette, C. (1988). The clustering of mixed-mode data: A comparison of possible approaches. J. Appl. Stat. 17 283-297.
[17] Filmer, D. and Pritchett, L. H. (2001). Estimating wealth effects without expenditure data-Or tears: An application to educational enrollments in states of India. Demography 38 115-132.
[18] Fokoue, E. and Titterington, D. M. (2003). Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation. Machine Learning 50 73-94. · Zbl 1033.68085 · doi:10.1023/A:1020297828025
[19] Fox, J.-P. (2010). Bayesian Item Response Modeling : Theory and Applications . Springer, New York. · Zbl 1271.62012
[20] Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis. Computer Journal 41 578-588. · Zbl 0920.68038 · doi:10.1093/comjnl/41.8.578
[21] Friel, N. and Wyse, J. (2011). Estimating the evidence-A review. Stat. Neerl. 66 288-308. · doi:10.1111/j.1467-9574.2011.00515.x
[22] Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models . Springer, New York. · Zbl 1108.62002
[23] Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2003). Bayesian Data Analysis . Chapman & Hall/CRC, London. · Zbl 1279.62004
[24] Geweke, J., Keane, M. and Runkle, D. (1994). Alternative computational approaches to inference in the multinomial probit model. The Review of Economics and Statistics 76 609-632.
[25] Geweke, J. F. and Zhou, G. (1996). Measuring the pricing error of arbitrage pricing theory. Review of Financial Studies 9 557-587.
[26] Ghahramani, Z. and Hinton, G. E. (1997). The EM algorithm for mixtures of factor analyzers. Technical report, Univ. Toronto.
[27] Gollini, I. and Murphy, T. B. (2013). Mixture of latent trait analyzers for model-based clustering of categorical data. Statist. Comput. 1-20. · Zbl 1325.62122 · doi:10.1007/s11222-013-9389-1
[28] Gormley, I. C. and Murphy, T. B. (2006). Analysis of Irish third-level college applications data. J. Roy. Statist. Soc. Ser. A 169 361-379. · Zbl 05273911 · doi:10.1111/j.1467-985X.2006.00412.x
[29] Gormley, I. C. and Murphy, T. B. (2008). A mixture of experts model for rank data with applications in election studies. Ann. Appl. Stat. 2 1452-1477. · Zbl 1454.62498 · doi:10.1214/08-AOAS178
[30] Gruhl, J., Erosheva, E. A. and Crane, P. K. (2013). A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes. Ann. Appl. Stat. 7 2361-2383. · Zbl 1283.62218 · doi:10.1214/13-AOAS675
[31] Gwatkin, D. R., Rutstein, S., Johnson, K., Suliman, E., Wagstaff, A. and Amouzou, A. (2007). Socio-economic differences in health, nutrition, and population within developing countries: An Overview. Country Reports on HNP and Poverty, The World Bank, Washington, DC.
[32] Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301-354. · Zbl 05273954 · doi:10.1111/j.1467-985X.2007.00471.x
[33] Hoff, P. D. (2009). A First Course in Bayesian Statistical Methods . Springer, New York. · Zbl 1213.62044
[34] Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090-1098. · Zbl 1041.62098 · doi:10.1198/016214502388618906
[35] Hunt, L. and Jorgensen, M. (1999). Mixture model clustering using the MULTIMIX program. Aust. N. Z. J. Stat. 41 153-171. · Zbl 0962.62061 · doi:10.1111/1467-842X.00071
[36] Hunt, L. and Jorgensen, M. (2003). Mixture model clustering for mixed data with missing information. Comput. Statist. Data Anal. 41 429-440. · Zbl 1256.62037
[37] Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. (1991). Adaptive mixture of local experts. Neural Comput. 3 79-87.
[38] Johnson, V. E. and Albert, J. H. (1999). Ordinal Data Modeling . Springer, New York. · Zbl 0921.62141
[39] Kahn, K., Tollman, S. M., Collinson, M. A., Clark, S. J., Twine, R., Clark, B. D., Shabangu, M., Gómez-Olivé, F. X., Mokoena, O. and Garenne, M. L. (2007). Research into health, population and social transitions in rural South Africa: Data and methods of the Agincourt Health and Demographic Surveillance System1. Scandinavian Journal of Public Health 35 8-20.
[40] Lawrence, C. J. and Krzanowski, W. J. (1996). Mixture separation for mixed-mode data. Statist. Comput. 6 85-92.
[41] Le Cam, L. and Yang, G. L. (1990). Asymptotics in Statistics : Some Basic Concepts . Springer, New York. · Zbl 0719.62003
[42] Lopes, H. F. and West, M. (2004). Bayesian model assessment in factor analysis. Statist. Sinica 14 41-67. · Zbl 1035.62060
[43] Lord, F. M. (1952). The relation of the reliability of multiple-choice tests to the distribution of item difficulties. Psychometrika 17 181-194. · Zbl 0049.37502 · doi:10.1007/BF02288781
[44] Lord, F. M. and Novick, M. R. (1968). Statistical Theories of Mental Test Scores . Addison-Wesley, Reading, MA. · Zbl 0186.53701
[45] Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika 47 149-174. · Zbl 0493.62094 · doi:10.1007/BF02296272
[46] McCulloch, R. and Rossi, P. E. (1994). An exact likelihood analysis of the multinomial probit model. J. Econometrics 64 207-240. · Zbl 04521352 · doi:10.1016/0304-4076(94)90064-7
[47] McKenzie, D. J. (2005). Measuring inequality with asset indicators. Journal of Population Economics 18 229-260.
[48] McNicholas, P. D. and Murphy, T. B. (2008). Parsimonious Gaussian mixture models. Stat. Comput. 18 285-296. · doi:10.1007/s11222-008-9056-0
[49] McParland, D. and Gormley, I. C. (2013). Clustering Ordinal Data via Latent Variable Models. Studies in Classification , Data Analysis , and Knowledge Organization 547 . Springer, Berlin.
[50] McParland, D., Gormley, I., McCormick, T. H., Clark, S. J., Kabudula, C. and Collinson, M. A. (2014a). Supplement to “Clustering South African households based on their asset status using latent variable models.” , DOI:10.1214/14-AOAS726SUPPB , DOI:10.1214/14-AOAS726SUPPC . · Zbl 1454.62503
[51] McParland, D., Gormley, I. C., Brennan, L. and Roche, H. M. (2014b). Clustering mixed continuous and categorical data from the LIPGENE study: Examining the interaction of nutrients and genotype in the metabolic syndrome. Technical report, Univ. College Dublin.
[52] Murray, J. S., Dunson, D. B., Carin, L. and Lucas, J. E. (2013). Bayesian Gaussian copula factor models for mixed data. J. Amer. Statist. Assoc. 108 656-665. · Zbl 06195968 · doi:10.1080/01621459.2012.762328
[53] Muthén, B. and Shedden, K. (1999). Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 55 463-469. · Zbl 1059.62599 · doi:10.1111/j.0006-341X.1999.00463.x
[54] Nobile, A. (1998). A hybrid Markov chain for the Bayesian analysis of the multinomial probit model. Statist. Comput. 8 229-242.
[55] Quinn, K. M. (2004). Bayesian factor analysis for mixed ordinal and continuous responses. Political Analysis 12 338-353.
[56] Rao, C. R. (1995). A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Qüestiió (2) 19 23-63. · Zbl 1167.62421
[57] Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests . The Danish Institute for Educational Research, Copenhagen.
[58] Rutstein, S. O. and Johnson, K. (2004). The DHS wealth index. DHS comparative Reports No. 6, ORC Macro, Calverton, MD.
[59] Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monographs 17 .
[60] Stephens, M. (2000). Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 795-809. · Zbl 0957.62020 · doi:10.1111/1467-9868.00265
[61] Svalfors, S. (2006). The Moral Economy of Class : Class and Attitudes in Comparative Perspective . Stanford Univ. Press, Stanford, CA.
[62] Thurstone, L. L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psychology 16 433-451.
[63] Vermunt, J. K. (2001). The use of restricted latent class models for defining and testing nonparametric and parametric item response theory models. Appl. Psychol. Meas. 25 283-294. · doi:10.1177/01466210122032082
[64] Vyas, S. and Kumaranayake, L. (2006). Constructing socio-economic status indices: How to use principal components analysis. Health Policy Plan 21 459-468.
[65] Weeden, K. A. and Grusky, D. B. (2012). The three worlds of inequality. American Journal of Sociology 117 1723-1785.
[66] Willse, A. and Boik, R. J. (1999). Identifiable finite mixtures of location models for clustering mixed-mode data. Statist. Comput. 9 111-121.
[67] Zhang, X., Boscardin, W. J. and Belin, T. R. (2008). Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Comput. Statist. Data Anal. 52 3697-3708. · Zbl 1452.62233
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.