Clustering for multivariate continuous and discrete longitudinal data. (English) Zbl 1454.62197

Summary: Multiple outcomes, both continuous and discrete, are routinely gathered on subjects in longitudinal studies and during routine clinical follow-up in general. To motivate our work, we consider a longitudinal study on patients with primary biliary cirrhosis (PBC) with a continuous bilirubin level, a discrete platelet count and a dichotomous indication of blood vessel malformations as examples of such longitudinal outcomes. An apparent requirement is to use all the outcome values to classify the subjects into groups (e.g., groups of subjects with a similar prognosis in a clinical setting). In recent years, numerous approaches have been suggested for classification based on longitudinal (or otherwise correlated) outcomes, targeting not only traditional areas like biostatistics, but also rapidly evolving bioinformatics and many others. However, most available approaches consider only continuous outcomes as a basis for classification, or if noncontinuous outcomes are considered, then not in combination with other outcomes of a different nature. Here, we propose a statistical method for clustering (classification) of subjects into a prespecified number of groups with a priori unknown characteristics on the basis of repeated measurements of several longitudinal outcomes of a different nature. This method relies on a multivariate extension of the classical generalized linear mixed model where a mixture distribution is additionally assumed for random effects. We base the inference on a Bayesian specification of the model and simulation-based Markov chain Monte Carlo methodology. To apply the method in practice, we have prepared ready-to-use software for use in R (http://www.R-project.org). We also discuss evaluation of uncertainty in the classification and also discuss usage of a recently proposed methodology for model comparison-the selection of a number of clusters in our case-based on the penalized posterior deviance proposed by M. Plummer [Biostatistics 9, No. 3, 523–539 (2008; Zbl 1143.62003)].


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F15 Bayesian inference
62R10 Functional data analysis
62P10 Applications of statistics to biology and medical sciences; meta analysis
62-08 Computational methods for problems pertaining to statistics


Zbl 1143.62003
Full Text: DOI arXiv Euclid


[1] Benaglia, T., Chauveau, D., Hunter, D. R. and Young, D. (2009). Mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software 32 1-29.
[2] Booth, J. G., Casella, G. and Hobert, J. P. (2008). Clustering using objective functions and stochastic search. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 119-139. · Zbl 1400.62128
[3] Cabral, C. R. B., Lachos, V. H. and Madruga, M. R. (2012). Bayesian analysis of skew-normal independent linear mixed models with heterogeneity in the random-effects population. J. Statist. Plann. Inference 142 181-200. · Zbl 1229.62026
[4] Celeux, G., Martin, O. and Lavergne, C. (2005). Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Stat. Model. 5 243-267. · Zbl 1111.62103
[5] Celeux, G., Forbes, F., Robert, C. P. and Titterington, D. M. (2006). Deviance information criteria for missing data models. Bayesian Anal. 1 651-673 (electronic). · Zbl 1331.62329
[6] Dasgupta, A. and Raftery, A. E. (1998). Detecting features in spatial point processes with clutter via model-based clustering. J. Amer. Statist. Assoc. 93 294-302. · Zbl 0906.62105
[7] De la Cruz-Mesía, R., Quintana, F. A. and Marshall, G. (2008). Model-based clustering for longitudinal data. Comput. Statist. Data Anal. 52 1441-1457. · Zbl 1452.62454
[8] De la Fé Rodríguez, P. Y., Coddens, A., Del Fava, E., Abrahantes, J. C., Shkedy, Z., Martin, L. O. M., Muñoz, E. C., Duchateau, L., Cox, E. and Goddeeris, B. M. (2011). High prevalence of F4+ and F18+Escherichia coli in Cuban piggeries as determined by serological survey. Tropical Animal Health and Production 43 937-946.
[9] Dickson, E. R., Grambsch, P. M., Fleming, T. R., Fisher, L. D. and Langworthy, A. (1989). Prognosis in primary biliary cirrhosis: Model for decision making. Hepatology 10 1-7.
[10] Fleming, T. R. and Harrington, D. P. (1991). Counting Processes and Survival Analysis . Wiley, New York. · Zbl 0727.62096
[11] Fong, Y., Rue, H. and Wakefield, J. (2010). Bayesian inference for generalized linear mixed models. Biostatistics 11 397-412.
[12] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611-631. · Zbl 1073.62545
[13] Fraley, C. and Raftery, A. E. (2006). MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, Dept. Statistics, Univ. Washington.
[14] Gelfand, A. E., Sahu, S. K. and Carlin, B. P. (1995). Efficient parameterisations for normal linear mixed models. Biometrika 82 479-488. · Zbl 0832.62064
[15] Grün, B. and Leisch, F. (2007). Fitting finite mixtures of generalized linear regressions in \(\mathsf{R}\). Comput. Statist. Data Anal. 51 5247-5252. · Zbl 1445.62192
[16] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning : Data Mining , Inference , and Prediction , 2nd ed. Springer, New York. · Zbl 1273.62005
[17] Hennig, C. (2004). Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann. Statist. 32 1313-1340. · Zbl 1047.62063
[18] James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional data. J. Amer. Statist. Assoc. 98 397-408. · Zbl 1041.62052
[19] Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate Statistical Analysis , 6th ed. Pearson Prentice Hall, Upper Saddle River, NJ. · Zbl 1269.62044
[20] Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773-795. · Zbl 0846.62028
[21] Komárek, A. (2009). A new R package for Bayesian estimation of multivariate normal mixtures allowing for selection of the number of components and interval-censored data. Comput. Statist. Data Anal. 53 3932-3947. · Zbl 1453.62020
[22] Komárek, A. and Komárková, L. (2013). Supplement to “Clustering for multivariate continuous and discrete longitudinal data.” . · Zbl 1454.62197
[23] Komárek, A., Hansen, B. E., Kuiper, E. M. M., van Buuren, H. R. and Lesaffre, E. (2010). Discriminant analysis using a multivariate linear mixed model with a normal mixture in the random effects distribution. Stat. Med. 29 3267-3283.
[24] Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38 963-974. · Zbl 0512.62107
[25] Liu, X. and Yang, M. C. K. (2009). Simultaneous curve registration and clustering for functional data. Comput. Statist. Data Anal. 53 1361-1376. · Zbl 1452.62993
[26] Ma, P., Castillo-Davis, C. I., Zhong, W. and Liu, J. S. (2006). A data-driven clustering method for time course gene expression data. Nucleic Acids Res. 34 1261-1269.
[27] McLachlan, G. J. and Basford, K. E. (1988). Mixture Models : Inference and Applications to Clustering. Statistics : Textbooks and Monographs 84 . Dekker, New York. · Zbl 0697.62050
[28] McLachlan, G. and Peel, D. (2000). Finite Mixture Models . Wiley, New York. · Zbl 0963.62061
[29] Molenberghs, G. and Verbeke, G. (2005). Models for Discrete Longitudinal Data . Springer, New York. · Zbl 1093.62002
[30] Newton, M. A. and Chung, L. M. (2010). Gamma-based clustering via ordered means with application to gene-expression analysis. Ann. Statist. 38 3217-3244. · Zbl 1233.62002
[31] Peng, J. and Müller, H.-G. (2008). Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions. Ann. Appl. Stat. 2 1056-1077. · Zbl 1149.62053
[32] Plummer, M. (2008). Penalized loss functions for Bayesian model comparison. Biostatistics 9 523-539. · Zbl 1143.62003
[33] Qin, L.-X. and Self, S. G. (2006). The clustering of regression models method with applications in gene expression data. Biometrics 62 526-533. · Zbl 1097.62134
[34] Quandt, R. E. and Ramsey, J. B. (1978). Estimating mixtures of normal distributions and switching regressions. J. Amer. Statist. Assoc. 73 730-738. · Zbl 0401.62024
[35] R Development Core Team. (2012). R : A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Available at .
[36] Ramoni, M. F., Sebastiani, P. and Kohane, I. S. (2002). Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA 99 9121-9126 (electronic). · Zbl 1023.62110
[37] Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. Ser. B 59 731-792. · Zbl 0891.62020
[38] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. · Zbl 0379.62005
[39] Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 583-639. · Zbl 1067.62010
[40] Spiessens, B., Verbeke, G. and Komárek, A. (2002). A SAS-macro for the classification of longitudinal profiles using mixtures of normal distributions in nonlinear and generalised linear mixed models. Technical Report, Biostatistical Center, Catholic Univ. Leuven, Leuven.
[41] Stephens, M. (2000). Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 795-809. · Zbl 0957.62020
[42] Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions . Wiley, Chichester. · Zbl 0646.62013
[43] Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. J. Amer. Statist. Assoc. 91 217-221. · Zbl 0870.62057
[44] Villarroel, L., Marshall, G. and Barón, A. E. (2009). Cluster analysis using multivariate mixed effects models. Stat. Med. 28 2552-2565.
[45] Witten, D. M. (2011). Classification and clustering of sequencing data using a Poisson model. Ann. Appl. Stat. 5 2493-2518. · Zbl 1234.62150
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.