×

Partition of interval-valued observations using regression. (English) Zbl 07512353

Summary: Both regression modeling and clustering methodologies have been extensively studied as separate techniques. There has been some activity in using regression-based algorithms to partition a data set into clusters for classical data; we propose one such algorithm to cluster interval-valued data. The new algorithm is based on the \(k\)-means algorithm of J. MacQueen [in: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability. Vol. 1. Berkeley, CA: University of California Press. 281–297 (1967; Zbl 0214.46201)] and the dynamical partitioning method of E. Diday and J. C. Simon [in: Digital pattern recognition. Berlin, Heidelberg, New York: Springer. 47–94 (1976; Zbl 0331.62043)], with the partitioning criteria being based on establishing regression models for each sub-cluster. This also depends on distance measures between the underlying regression models for each sub-cluster. Several types of simulated data sets are generated for several different data structures. The proposed \(k\)-regressions algorithm consistently out-performs the \(k\)-means algorithm. Elbow plots are used to identify the total number of clusters \(K\) in the partition. The new method is also applied to real data.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Anderberg, MR, Cluster analysis for applications (1973), New York: Academic Press, New York · Zbl 0299.62029
[2] Batagelj, V., Kejžar, N., & Korenjak-Černe, S. (2015). Clustering of modal valued symbolic data. Machine Learnin. arXiv:1507.06683. · Zbl 07363883
[3] Bertrand, P., & Goupil, F. (2000). Descriptive statistics for symbolic data. In H.-H. Bock E. Diday (Eds.) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data (pp. 103-124). Berlin: Springer. · Zbl 0978.62005
[4] Billard, L., Brief overview of symbolic data and analytic issues, Statistical Analysis and Data Mining, 4, 149-156 (2011) · Zbl 07260274 · doi:10.1002/sam.10115
[5] Billard, L. (2014). The past’s present is now. What will the present’s future bring? In X. Lin, C. Genest, D.L. Banks, G. Molenberghs, D.W. Scott, & J.-L. Wang (Eds.) Past, present, and future of statistical science (pp. 323-334). New York: Chapman and Hall.
[6] Billard, L., & Diday, E. (2000). Regression analysis for interval-valued data. In H.A.L. Kiers, J.-P. Rasson, P.J.F. Groenen, & M. Schader (Eds.) Data analysis, classification, and related methods (pp. 369-374). Springer. · Zbl 1026.62073
[7] Billard, L.; Diday, E., From the statistics of data to the statistics of knowledge: Symbolic data analysis, Journal American Statistical Association, 98, 470-487 (2003) · doi:10.1198/016214503000242
[8] Billard, L.; Diday, E., Symbolic data analysis: conceptual statistics and data mining (2006), Chichester: Wiley, Chichester · Zbl 1117.62002 · doi:10.1002/9780470090183
[9] Bock, H.-H. (2007). Clustering methods: A history of k-means algorithms. In P. Brito, P. Bertrand, G. Cucumel, & F. de Carvalho (Eds.) Selected contributions in data analysis and classification (pp. 161-172). Berlin: Springer. · Zbl 1181.68229
[10] Bock, H-H, Origins and extensions of the k-means algorithm in cluster analysis, Journal Électronique d’Histoire des Probabilités et Statistics, 4, 1-18 (2008) · Zbl 1175.01030
[11] Bock, H-H; Diday, E., Analysis of symbolic data: Exploratory methods for extracting statistical information from complex data (2000), Berlin: Springer, Berlin · Zbl 1039.62501 · doi:10.1007/978-3-642-57155-8
[12] Bougeard, S.; Abdi, H.; Saporta, G.; Niang, N., Clusterwise analysis for multiblock component methods, Advances in Data and Analysis of Classification, 12, 285-313 (2018) · Zbl 1414.62231 · doi:10.1007/s11634-017-0296-8
[13] Bougeard, S.; Cariou, V.; Saporta, G.; Niang, N., Prediction for regularized clusterwise multiblock regression, Applied Stochastic Models for Business and Industry, 34, 852-867 (2017) · Zbl 1414.62230 · doi:10.1002/asmb.2335
[14] Brusco, MJ; Cradit, JD; Steinley, D.; Fox, GL, Cautionary remarks on the use of clusterwise regression, Multivariate Behavioral Research, 43, 29-49 (2008) · doi:10.1080/00273170701836653
[15] Charles, C. (1977). Regression typologique et reconnaissance des formes thèse de 3ème cycle. Université de, Paris, Dauphine.
[16] Chavent, M., Lechevallier, Y., Jajuga, K., Sokolowski, A., & Bock, H.-H. (2002). Dynamical clustering of interval data: Optimization of an adequacy criterion based on Hausdorff distance. In Classification, clustering, and data analysis (pp. 53-60). Berlin: Springer. · Zbl 1032.62058
[17] Cormack, RM, A review of classification, Journal of the Royal Statistical Society A, 134, 321-367 (1971) · doi:10.2307/2344237
[18] de Carvalho, F.A.T., Lima Neto, E.A., & Tenorio, C.P. (2004a). A new method to fit a linear regression model for interval-valued data. In Lecture notes in computer science, KI2004 advances in artificial intelligence (pp. 295-306). Springer. · Zbl 1132.68617
[19] de Carvalho, F.A.T., de Souza, R.M.C.R., & Silva, F.C.D. (2004b). A clustering method for symbolic interval-type data using adaptive Chebyshev distances. In A.L.C. Bazzan S. Labidi (Eds.) LNAI 3171 (pp. 266-275). Berlin: Springer. · Zbl 1105.68410
[20] de Carvalho, FAT; Brito, MP; Bock, H-H, Dynamic clustering for interval data based on l2 distance, Computational Statistics, 21, 231-250 (2006) · Zbl 1114.62070 · doi:10.1007/s00180-006-0261-z
[21] de Carvalho, FAT; Lechevallier, Y., Partitional clustering algorithms for symbolic interval data based on single adaptive distances, Pattern Recognition, 42, 1223-1236 (2009) · Zbl 1183.68527 · doi:10.1016/j.patcog.2008.11.016
[22] de Carvalho, F.A.T., Saporta, G., & Queiroz, D.N. (2010). A clusterwise center and range regression model for interval-valued data. In Y. Lechevallier G. Saporta (Eds.) Proceedings in computational statistics COMPSTAT 2010 (pp. 461-468). Berlin: Springer. · Zbl 1436.62371
[23] DeSarbo, WS; Cron, WL, A maximum likelihood methodology for clusterwise linear regression, Journal of Classification, 5, 249-282 (1988) · Zbl 0692.62052 · doi:10.1007/BF01897167
[24] de Souza, RMCR; de Carvalho, FAT, Clustering of interval data based on city-block distances, Pattern Recognition Letters, 25, 353-365 (2004) · doi:10.1016/j.patrec.2003.10.016
[25] de Souza, R.M.C.R., de Carvalho, F.A.T., Tenóio, C.P., & Lechevallier, Y. (2004). Dynamic cluster methods for interval data based on Mahalanobis distances. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.) Classification, clustering, and data analysis (pp. 251-360). Berlin: Springer.
[26] Diday, E., Une nouvelle méthode de classification automatique et reconnaissance des formes: la méthode des nuées dynamiques, Revue de Statistique Appliquée, 2, 19-33 (1971)
[27] Diday, E., La méthode des nuées dynamiques, Revue de Statistique Appliquée, 19, 19-34 (1971)
[28] Diday, E. (1987). Introduction à l’approche symbolique en analyse des données. Premier Jouneles Symbolique-Numerique, CEREMADE, Universite Paris - Dauphine, 21-56.
[29] Diday, E., Thinking by classes in data science: The symbolic data analysis paradigm, WIRES Computational Statistics, 8, 172-205 (2016) · doi:10.1002/wics.1384
[30] Diday, E.; Noirhomme-Fraiture, M., Symbolic data analysis and the SODAS software (2008), Chichester: Wiley, Chichester · Zbl 1275.62029
[31] Diday, E., & Simon, J.C. (1976). Clustering analysis. In K.S. Fu (Ed.) Digital pattern recognition (pp. 47-94). Berlin: Springer. · Zbl 0331.62043
[32] Draper, NR; Smith, H., Applied regression analysis (1966), New York: Wiley, New York · Zbl 0158.17101
[33] Hausdorff, F., Set theory (translated into English by J. R. Aumann 1957) (1937), New York: Chelsey, New York
[34] Irpino, A., Verde, R., & Lechevallier, Y. (2006). Dynamic clustering of histograms using Wasserstein metric. In A. Rizzi M. Vichi (Eds.) COMPSTAT 2006 (pp. 869-876). Berlin: Physica-Verlag.
[35] Jain, AK, Data clustering: 50 years beyond K-means, Pattern Recognition Letters, 31, 651-666 (2010) · doi:10.1016/j.patrec.2009.09.011
[36] Jain, AK; Murty, MN; Flynn, PJ, Data clustering: A review, ACM Computing Surveys, 31, 263-323 (1999) · doi:10.1145/331499.331504
[37] Johnson, RA; Wichern, DW, Applied multivariate statistical analysis (2007), New Jersey: Prentice-Hall, New Jersey · Zbl 1269.62044
[38] Korenjak-Černe, S.; Batagelj, V.; Pavešić, BJ, Clustering large data sets described with discrete distributions and its application on TIMSS data set, Statistical Analysis and Data Mining, 4, 199-215 (2011) · Zbl 07260278 · doi:10.1002/sam.10105
[39] Košmelj, K.; Billard, L., Mallows’l2 distance in some multivariate methods and its application to histogram-type data, Metodološki Zvezki, 9, 107-118 (2012)
[40] Leroy, B., Chouakria, A., Herlin, I., & Diday, E. (1996). Approche géométrique et classification pour la reconnaissance de visage. Reconnaissance des Forms et Intelligence Artificelle, INRIA and IRISA and CNRS, France, 548-557.
[41] Lima Neto, EA; de Carvalho, FAT, Centre and range method for fitting a linear regression model to symbolic interval data, Computational Statistics and Data Analysis, 52, 1500-1515 (2008) · Zbl 1452.62493 · doi:10.1016/j.csda.2007.04.014
[42] Lima Neto, E.A., de Carvalho, F.A.T., & Freire, E.S. (2005). Applying constrained linear aggression models to predict interval-valued data. In U. Furbach (Ed.) Lecture notes in computer science, KI: advances in artificial intelligence (pp. 92-106). Brelin: Springer. · Zbl 1137.62357
[43] Lima Neto, E.A., de Carvalho, F.A.T., & Tenorio, C.P. (2004). Univariate and multivariate linear regression methods to predict interval-valued features. In Lecture notes in computer science, AI 2004, advances in artificial intelligence (pp. 526-537). Berlin: Springer.
[44] Liu, F. (2016). Cluster analysis for symbolic interval data using linear regression method. Doctoral Dissertation, University of Georgia.
[45] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L.M. LeCam J. Neyman (Eds.) Proceedings of the 5th berkeley symposium on mathematical statistics and probability, (Vol. 1 pp. 281-299). Berkeley: University of California Press. · Zbl 0214.46201
[46] Noirhomme-Fraiture, M.; Brito, MP, Far beyond the classical data models: Symbolic data analysis, Statistical Analysis and Data Mining, 4, 157-170 (2011) · Zbl 07260275 · doi:10.1002/sam.10112
[47] Qian, G.; Wu, Y., Estimation and selection in regression clustering, European Journal of Pure and Applied Mathematics, 4, 455-466 (2011) · Zbl 1389.62102
[48] Rao, C.R., Wu, Y., & Shao, Q. (2007). An M-estimation-based procedure for determining the number of regression models in regression clustering. Journal of Applied Mathematics and Decision Sciences, Article ID 37475. · Zbl 05304406
[49] Shao, Q.; Wu, Y., A consistent procedure for determining the number of clusters in regression clustering, Journal of Statistical Planning and Inference, 135, 461-476 (2005) · Zbl 1074.62042 · doi:10.1016/j.jspi.2004.04.021
[50] Späth, H., Algorithm 39 clusterwise linear regression, Computing, 22, 367-373 (1979) · Zbl 0387.65028 · doi:10.1007/BF02265317
[51] Späth, H., Correction to algorithm 39: clusterwise linear regression, Computing, 26, 275 (1981) · Zbl 0444.65020 · doi:10.1007/BF02243486
[52] Späth, H., A fast algorithm for clusterwise linear regression, Computing, 29, 175-181 (1982) · Zbl 0485.65030 · doi:10.1007/BF02249940
[53] Tibshirani, R.; Walther, G.; Hastie, T., Estimating the number of clusters in a data set via the gap statistic, Journal of the Royal Statistical Society B, 63, 411-423 (2001) · Zbl 0979.62046 · doi:10.1111/1467-9868.00293
[54] Verde, R., & Irpino, A. (2007). Dynamic clustering of histogram data: Using the right metric. In P. Brito, P. Bertrand, G. Cucumel, & F. de Carvalho (Eds.) Selected contributions in data analysis and classification (pp. 123-134). Berlin: Springer. · Zbl 1151.62335
[55] Wedel, M.; Kistemaker, C., Consumer benefit segmentation using clusterwise linear regression, International Journal of Research in Marketing, 6, 45-59 (1989) · doi:10.1016/0167-8116(89)90046-3
[56] Xu, W. (2010). Symbolic data analysis: interval-valued data regression. Doctoral Dissertation, University of Georgia.
[57] Zhang, B. (2003). Regression clustering. In X. Wu, A. Tuzhilin, & J. Shavlik (Eds.) Proceedings third IEEE international conference on data mining (pp. 451-458). California: IEEE Computer Society Publishers.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.