×

Model-based time-varying clustering of multivariate longitudinal data with covariates and outliers. (English) Zbl 1464.62128

Summary: A class of multivariate linear models under the longitudinal setting, in which unobserved heterogeneity may evolve over time, is introduced. A latent structure is considered to model heterogeneity, having a discrete support and following a first-order Markov chain. Heavy-tailed multivariate distributions are introduced to deal with outliers. Maximum likelihood estimation is performed to estimate parameters by using expectation-maximization and expectation-conditional-maximization algorithms. Notes on model identifiability and robustness are provided, along with all computational details needed to implement the proposal. Three applications on artificial and real data are illustrated. These focus on the potential effects of outliers on clustering and their identification.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62M05 Markov processes: estimation; hidden Markov models

Software:

flexmix; NHMSAR; R
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Ailliot, P.; Bessac, J.; Monbet, V.; Pene, F., Non-homogeneous hidden Markov-switching models for wind time series, J. Statist. Plann. Inference, 160, 75-88, (2015) · Zbl 1311.62189
[2] Bagnato, L.; Greselin, F.; Punzo, A., On the spectral decomposition in normal discriminant analysis, Comm. Statist. Simulation Comput., 43, 6, 1471-1489, (2014) · Zbl 1333.62056
[3] Bagnato, L.; Punzo, A., Finite mixtures of unimodal beta and gamma densities and the \(k\)-bumps algorithm, Comput. Statist., 28, 4, 1571-1597, (2013) · Zbl 1306.65024
[4] Bai, X.; Chen, K.; Yao, W., Mixture of linear mixed models using multivariate \(t\) distribution, J. Stat. Comput. Simul., 86, 4, 771-787, (2016)
[5] Bai, X.; Yao, W.; Boyer, J. E., Robust Fitting of mixture regression models, Comput. Statist. Data Anal., 56, 7, 2347-2359, (2012) · Zbl 1252.62011
[6] Bartolucci, F.; Farcomeni, A., A multivariate extension of the dynamic logit model for longitudinal data based on a latent Markov heterogeneity structure, J. Amer. Statist. Assoc., 104, 486, 816-831, (2009) · Zbl 1388.62158
[7] Bartolucci, F.; Farcomeni, A., A discrete time event-history approach to informative drop-out in mixed latent Markov models with covariates, Biometrics, 71, 1, 80-89, (2015) · Zbl 1419.62308
[8] Bartolucci, F.; Farcomeni, A.; Pennoni, F., Latent Markov models for longitudinal data, (2013), CRC Press · Zbl 1341.62002
[9] Baum, L. E.; Petrie, T.; Soules, G.; Weiss, N., A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, Ann. Math. Statist., 41, 1, 164-171, (1970) · Zbl 0188.49603
[10] Berkane, M.; Bentler, P. M., Estimation of contamination parameters and identification of outliers in multivariate data, Sociol. Methods Res., 17, 1, 55-64, (1988)
[11] Biernacki, C.; Celeux, G.; Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE Trans. Pattern Anal. Mach. Intell., 22, 7, 719-725, (2000)
[12] Biernacki, C.; Celeux, G.; Govaert, G., Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models, Comput. Statist. Data Anal., 41, 3-4, 561-575, (2003) · Zbl 1429.62235
[13] Browne, R. P.; McNicholas, P. D., A mixture of generalized hyperbolic distributions, Canad. J. Statist., 43, 2, 176-198, (2015) · Zbl 1320.62144
[14] Bulla, J.; Berzel, A., Computational issues in parameter estimation for stationary hidden Markov models, Comput. Statist., 23, 1, 1-18, (2008)
[15] Campbell, N. A.; Mahon, R. J., A multivariate study of variation in two species of rock crab of genus leptograpsus, Aust. J. Zool., 22, 3, 417-425, (1974)
[16] Crawford, S. L., An application of the Laplace method to finite mixture distributions, J. Amer. Statist. Assoc., 89, 425, 259-267, (1994) · Zbl 0795.62022
[17] Dannemann, J.; Holzmann, H.; Leister, A., Semiparametric hidden Markov models: identifiability and estimation, Wiley Interdiscip. Rev. Comput. Stat., 6, 6, 418-425, (2014)
[18] Dempster, A.; Laird, N.; Rubin, D., Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B Stat. Methodol., 39, 1, 1-38, (1977) · Zbl 0364.62022
[19] Dickson, E. R.; Grambsch, P. M.; Fleming, T. R.; Fisher, L. D.; Langworthy, A., Prognosis inprimary biliary-cirrhosis: model for decision-making, Hepatology, 10, 1-7, (1989)
[20] Farcomeni, A., Quantile regression for longitudinal data based on latent Markov subject-specific parameters, Stat. Comput., 22, 1, 141-152, (2012) · Zbl 1322.62206
[21] Farcomeni, A.; Greco, L., S-estimation of hidden Markov models, Comput. Statist., 30, 1, 57-80, (2015) · Zbl 1342.65032
[22] Franczak, B. C.; Browne, R. P.; McNicholas, P. D., Mixtures of shifted asymmetriclaplace distributions, IEEE Trans. Pattern Anal. Mach. Intell., 36, 6, 1149-1157, (2014)
[23] Frühwirth-Schnatter, S., Finite mixture and Markov switching models, (2006), Springer New York · Zbl 1108.62002
[24] Frühwirth-Schnatter, S., Panel data analysis: a survey on model-based clustering of time series, Adv. Data Anal. Classif., 5, 4, 251-280, (2011) · Zbl 1274.62591
[25] Frühwirth-Schnatter, S.; Kaufmann, S., Model-based clustering of multiple time series, J. Bus. Econom. Statist., 26, 1, 78-89, (2008)
[26] García-Escudero, L. A.; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., A review of robust clustering methods, Adv. Data Anal. Classif., 4, 2, 89-109, (2010) · Zbl 1284.62375
[27] García-Escudero, L. A.; Gordaliza, A.; Mayo-Iscar, A.; San Martín, R., Robust clusterwise linear regression through trimming, Comput. Statist. Data Anal., 54, 12, 3057-3069, (2010) · Zbl 1284.62198
[28] Goldfeld, S. M.; Quandt, R. E., A Markov model for switching regressions, J. Econometrics, 1, 1, 3-15, (1973) · Zbl 0294.62087
[29] Greselin, F.; Ingrassia, S.; Punzo, A., Assessing the pattern of covariance matrices via an augmentation multiple testing procedure, Stat. Methods Appl., 20, 2, 141-170, (2011) · Zbl 1232.62090
[30] Greselin, F.; Punzo, A., Closed likelihood ratio testing procedures to assess similarity of covariance matrices, Amer. Statist., 67, 3, 117-128, (2013)
[31] Grün, B.; Leisch, F., Finite mixtures of generalized linear regression models, (Recent Advances in Linear Models and Related Areas: Essays in Honour of Helge Toutenburg, (2008), Physica-Verlag HD Heidelberg), 205-230, (Chapter)
[32] Hamilton, J. D., Analysis of time series subject to changes in regime, J. Econometrics, 45, 1-2, 39-70, (1990) · Zbl 0723.62050
[33] Hartigan, J. A.; Hartigan, P. M., The dip test of unimodality, Ann. Statist., 13, 1, 70-84, (1985), 03 · Zbl 0575.62045
[34] Hennig, C., Identifiablity of models for clusterwise linear regression, J. Classification, 17, 2, 273-296, (2000) · Zbl 1017.62058
[35] Holzmann, H.; Munk, A.; Gneiting, T., Identifiability of finite mixtures of elliptical distributions, Scand. J. Statist., 33, 4, 753-763, (2006) · Zbl 1164.62354
[36] Ingrassia, S.; Minotti, S. C.; Punzo, A., Model-based clustering via linear cluster-weighted models, Comput. Statist. Data Anal., 71, 159-182, (2014)
[37] Ingrassia, S.; Punzo, A., Decision boundaries for mixtures of regressions, J. Korean Statist. Soc., 45, 2, 295-306, (2016) · Zbl 1341.62181
[38] Ingrassia, S.; Punzo, A.; Vittadini, G.; Minotti, S. C., The generalized linear mixed cluster-weighted model, J. Classification, 32, 1, 85-113, (2015) · Zbl 1331.62310
[39] Juárez, M. A.; Steel, M. F.J., Model-based clustering of non-Gaussian panel data based on skew-\(t\) distributions, J. Bus. Econom. Statist., 28, 1, 52-66, (2010) · Zbl 1198.62097
[40] Karlis, D.; Santourian, A., Model-based clustering with non-elliptically contoured distributions, Stat. Comput., 19, 1, 73-83, (2009)
[41] Lagona, F.; Jdanov, D.; Shkolnikova, M., Latent time-varying factors in longitudinal analysis: a linear mixed hidden Markov model for heart rates, Stat. Med., 33, 23, 4116-4134, (2014)
[42] Lagona, F.; Maruotti, A.; Padovano, F., Multilevel multivariate modelling of legislative count data, with a hidden Markov chain, J. Roy. Statist. Soc.-Ser. A, 178, 705-723, (2015)
[43] Langrock, R.; King, R., Maximum likelihood estimation of mark-recapture-recovery models in the presence of continuous covariates, Ann. Appl. Stat., 7, 3, 1709-1732, (2013) · Zbl 1454.62451
[44] Langrock, R.; Swihart, B. J.; Caffo, B. S.; Punjabi, N. M.; Crainiceanu, C. M., Combining hidden Markov models for comparing the dynamics of multiple sleep electroencephalograms, Stat. Med., 32, 19, 3342-3356, (2013)
[45] Lee, Y.; Ghosh, D.; Hardison, R. C.; Zhang, Y., Mrhmms: multivariate regression hidden Markov models and the variants, Bioninformatics, 30, 13, 1755-1756, (2014)
[46] Lee, S. X.; McLachlan, G. J., Model-based clustering and classification with non-normal mixture distributions, Stat. Methods Appl., 22, 4, 427-454, (2013) · Zbl 1332.62209
[47] Lee, S. X.; McLachlan, G. J., Finite mixtures of multivariate skew \(t\)-distributions: some recent and new results, Stat. Comput., 24, 2, 181-202, (2014) · Zbl 1325.62107
[48] Leroux, B. G., Maximum-likelihood estimation for hidden Markov models, Stochastic Process. Appl., 40, 1, 127-143, (1992) · Zbl 0738.62081
[49] Lin, T. I., Maximum likelihood estimation for multivariate skew normal mixture models, J. Multivariate Anal., 100, 2, 257-265, (2009) · Zbl 1152.62034
[50] Lin, T. I., Robust mixture modeling using multivariate skew \(t\) distributions, Stat. Comput., 20, 3, 343-356, (2010)
[51] Little, R. J.A., Robust estimation of the mean and covariance matrix from data with missing values, Appl. Stat., 37, 1, 23-38, (1988) · Zbl 0647.62040
[52] Lo, K.; Gottardo, R., Flexible mixture modeling via the multivariate \(t\) distribution with the box-Cox transformation: an alternative to the skew-\(t\) distribution, Stat. Comput., 22, 1, 33-52, (2012) · Zbl 1322.62173
[53] MacDonald, I. L., Numerical maximisation of likelihood: A neglected alternative to EM?, Internat. Statist. Rev., 82, 2, 296-308, (2014)
[54] Martinez-Zarzoso, I.; Maruotti, A., The environmental kuznets curve: functional form, time-varying heterogeneity and outliers in a panel setting, Environmetrics, 24, 7, 461-475, (2013)
[55] Maruotti, A., Mixed hidden Markov models for longitudinal data: an overview, Internat. Statist. Rev., 79, 3, 427-454, (2011) · Zbl 1238.62094
[56] Maruotti, A., Robust Fitting of hidden Markov regression models under a longitudinal setting, J. Stat. Comput. Simul., 84, 8, 1728-1747, (2014)
[57] Maruotti, A.; Punzo, A.; Mastrantonio, G.; Lagona, F., A time-dependent extension of the projected normal regression model for longitudinal circular data based on a hidden Markov heterogeneity structure, Stoch. Environ. Res. Risk Assess., (2016), (in press). http://dx.doi.org10.1007/s00477-015-1183-5
[58] Maruotti, A.; Rocci, R., A mixed non-homogeneous hidden Markov model for categorical data, with application to alcohol consumption, Stat. Med., 31, 9, 871-886, (2012)
[59] McLachlan, G. J., Discriminant analysis and statistical pattern recognition, (1992), John Wiley & Sons Hoboken, New Jersey, 2nd printing
[60] McLachlan, G. J.; Peel, D., Finite mixture models, (2000), John Wiley & Sons New York · Zbl 0963.62061
[61] Meng, X.-L.; Rubin, D. B., Maximum likelihood estimation via the ECM algorithm: A general framework, Biometrika, 80, 2, 267-278, (1993) · Zbl 0778.62022
[62] Punzo, A., Flexible mixture modeling with the polynomial Gaussian cluster-weighted model, Stat. Model., 14, 3, 257-291, (2014)
[63] Punzo, A.; Browne, R. P.; McNicholas, P. D., Hypothesis testing for mixture model selection, J. Stat. Comput. Simul., (2016), (in press). http://dx.doi.org10.1080/00949655.2015.1131282
[64] Punzo, A.; Ingrassia, S., Clustering bivariate mixed-type data via the cluster-weighted model, Comput. Statist., (2015), (in press). http://dx.doi.org10.1007/s00180-015-0600-z
[65] Punzo, A.; Maruotti, A., Clustering multivariate longitudinal observations: the contaminated Gaussian hidden Markov model, J. Comput. Graph. Statist., (2016), (in press). http://dx.doi.org10.1080/10618600.2015.1089776
[66] Punzo, A., McNicholas, P.D., 2014. Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. arXiv.org e-print 1409.6019. Available at: http://arxiv.org/abs/1409.6019.
[67] Punzo, A.; McNicholas, P. D., Parsimonious mixtures of multivariate contaminated normal distributions, Biom. J., (2016), (in press) · Zbl 1353.62124
[68] Pyne, S.; Hu, X.; Wang, K.; Rossin, E.; Lin, T. I.; Maier, L. M.; Baecher-Allan, C.; McLachlan, G. J.; Tamayo, P.; Hafler, D. A.; De Jager, P. L.; Mesirov, J. P., Automated high-dimensional flow cytometric data analysis, Proc. Natl. Acad. Sci., 106, 21, 8519-8524, (2009)
[69] Raffa, J. D.; Dubin, J. A., Multivariate longitudinal data analysis with mixed effects hidden Markov models, Biometrics, 71, 3, 821-831, (2015) · Zbl 1419.62428
[70] R Core Team, 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL: http://www.R-project.org/.
[71] Ritter, G., (Robust Cluster Analysis and Variable Selection, Chapman & Hall/CRC Monographs on Statistics & Applied Probability, vol. 137, (2015), CRC Press) · Zbl 1341.62037
[72] Schliehe-Diecks, S.; Kappeler, P.; Langrock, R., On the application of mixed hidden Markov models to multiplebehavioural time series, Interface Focus, 2, 180-189, (2012)
[73] Schork, N. J.; Schork, M. A., Skewness and mixtures of normal distributions, Comm. Statist. Theory Methods, 17, 11, 3951-3969, (1988) · Zbl 0696.62062
[74] Schreuder, H. T.; Hafley, W. L., A useful bivariate distribution for describing stand structure of tree heights and diameters, Biometrics, 33, 3, 471-478, (1977)
[75] Schwarz, G., Estimating the dimension of a model, Ann. Statist., 6, 2, 461-464, (1978) · Zbl 0379.62005
[76] Subedi, S.; Punzo, A.; Ingrassia, S.; McNicholas, P. D., Clustering and classification via cluster-weighted factor analyzers, Adv. Data Anal. Classif., 7, 1, 5-40, (2013) · Zbl 1271.62137
[77] Subedi, S.; Punzo, A.; Ingrassia, S.; McNicholas, P. D., Cluster-weighted \(t\)-factor analyzers for robust model-based clustering and dimension reduction, Stat. Methods Appl., 24, 4, 623-649, (2015) · Zbl 1416.62362
[78] Titterington, D. M.; Smith, A. F.M.; Makov, U. E., Statistical analysis of finite mixture distributions, (1985), John Wiley & Sons New York · Zbl 0646.62013
[79] Turner, R., Direct maximization of the likelihood of a hidden Markov model, Comput. Statist. Data Anal., 52, 9, 4147-4160, (2008) · Zbl 1452.62606
[80] Vermunt, J. K., Longitudinal research using mixture models, (Longitudinal Research with Latent Variables, (2010), Springer Berlin, Heidelberg), 119-152, (Chapter)
[81] Visser, I., Seven things to remember about hidden Markov models: A tutorial on Markovian models for time series, J. Math. Psych., 55, 6, 403-415, (2011) · Zbl 1229.62128
[82] Viterbi, A. J., Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Trans. Inform. Theory, 13, 2, 260-269, (1967) · Zbl 0148.40501
[83] Wang, W.-L., Multivariate t linear mixed models for irregularly observed multiple repeated measures with missing outcomes, Biom. J., 55, 4, 554-571, (2013) · Zbl 1441.62525
[84] Wang, W.-L.; Lin, T.-I.; Lachos, V. H., Extending multivariate-\(t\) linear mixed models for multiple longitudinal data with censored responses and heavy tails, Stat. Methods Med. Res., (2015), (in press). http://dx.doi.org/10.1177/0962280215620229
[85] Zhu, X.; Melnykov, V., Manly transformation in finite mixture modeling, Comput. Statist. Data Anal, (2016) · Zbl 1469.62184
[86] Zucchini, W.; MacDonald, I. L., Hidden Markov models for time series: an introduction using R, (2009), Chapman & Hall Boca Raton, FL · Zbl 1180.62130
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.