×

Exploratory data analysis of interval-valued symbolic data with matrix visualization. (English) Zbl 1506.62090

Summary: Symbolic data analysis (SDA) has gained popularity over the past few years because of its potential for handling data having a dependent and hierarchical nature. Amongst many methods for analyzing symbolic data, exploratory data analysis (EDA: [J. W. Tukey, Exploratory data analysis. Reading, Massachusetts etc.: Addison-Wesley Publishing Company (1977; Zbl 0409.62003)]) with graphical presentation is an important one. Recent developments of graphical and visualization tools for SDA include zoom star, closed shapes, and parallel-coordinate-plots. Other studies project high dimensional symbolic data into lower dimensional spaces using symbolic data versions of principal component analysis, multidimensional scaling, and self-organizing maps. Most graphical and visualization approaches for exploring symbolic data structure inherit the advantages of their counterparts for conventional (non-symbolic) data, but also their disadvantages. Here we introduce matrix visualization (MV) for visualizing and clustering symbolic data using interval-valued symbolic data as an example; it is by far the most popular symbolic data type in the literature and the most commonly encountered one in practice. Many MV techniques for visualizing and clustering conventional data are converted to symbolic data, and several techniques are newly developed for symbolic data. Various examples of data with simple to complex structures are brought in to illustrate the proposed methods.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62A09 Graphical methods in statistics

Citations:

Zbl 0409.62003

Software:

ZAME; GAP; SODAS
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Bertin, J.; Berg, William J., Semiology of graphics: diagrams, networks, maps, (1983), The University of Wisconsin Press Madison, WI, English translation by
[2] Bertrand, P.; Diday, E., A visual representation of the compatibility between an order and a dissimilarity index: the pyramids, Comput. Stat. Q., 2, 1, 31-42, (1985) · Zbl 0615.62080
[3] Billard, L.; Diday, E., Regression analysis for interval-valued data, (Kiers, H. A.L.; Rasson, J.-P.; Groenen, P. J.F.; Schader, M., Data Analysis, Classification, and Related Methods, (2000), Springer-Verlag Berlin), 369-374 · Zbl 1026.62073
[4] Billard, L.; Diday, E., From the statistics of data to the statistics of knowledge: symbolic data analysis, J. Amer. Statist. Assoc., 98, 470-487, (2003)
[5] Billard, L.; Diday, E., Symbolic data analysis: conceptual statistics and data mining, 231-248, (2006), John Wiley & Sons Ltd. England · Zbl 1117.62002
[6] Billard, L., Douzal-Chouakria, A., Diday, E., 2009. Symbolic principal component for interval-valued observations, http://hal.archives-ouvertes.fr/docs/00/36/10/53/PDF/DouzalPCA.pdf.
[7] Bock, H.-H., Clustering methods and Kohonen maps for symbolic data, J. Japanese Soc. Comput. Statist., 15, 1-13, (2002)
[8] Bock, H.-H., Visualizing symbolic data by Kohonen maps, (Diday, E.; Noirhomme, M., Symbolic Data Analysis and the SODAS Software, (2008), Wiley Chichester), 205-234
[9] (Bock, H.-H.; Diday, E., Analysis of Symbolic Data, (2000), Springer-Verlag Berlin, New York)
[10] Borland, D.; Taylor, R. M., Rainbow color map (still) considered harmful, IEEE Comput. Graph. Appl., 27, 2, 14-17, (2007)
[11] Brewer, C. A., Color use guidelines for mapping and visualization, (MacEachren, A. M.; Taylor, D. R.F., Visualization in Modern Cartography, (1994), Elsevier Science Tarrytown, NY), 123-147, (Chapter 7)
[12] Brewer, C. A., Color use guidelines for data representation, (Proceedings of the Section on Statistical Graphics, (1999), American Statistical Association), 50-60
[13] Brito, P., Hierarchical and pyramidal clustering for symbolic data, J. Japanese Soc. Comput. Statist., 15, 2, 231-244, (2002) · Zbl 1330.62245
[14] Brito, P.; Duarte Silva, A. P., Modelling interval data with normal and skew-normal distributions, J. Appl. Stat., 39, 1, 3-20, (2012)
[15] Chavent, M.; de Carvalho, F. A.T.; Lechevallier, Y.; Verde, R., New clustering methods for interval data, Comput. Statist., 21, 211-230, (2006) · Zbl 1114.62069
[16] Chavent, M.; Lechevallier, Y., Dynamical clustering of interval data. optimization of an adequacy criterion based on Hausdorff distance, (Jajuga, K.; Sokolowski, A.; Bock, H.-H., Classification, Clustering, and Data Analysis, (2002), Springer-Verlag Berlin), 53-59 · Zbl 1032.62058
[17] Chen, C. H., Generalized association plots: information visualization via iteratively generated correlation matrices, Statist. Sinica, 12, 7-29, (2002) · Zbl 1027.62047
[18] Chen, C. H.; Hwu, H. G.; Jang, W. J.; Kao, C. H.; Tien, Y. J.; Tzeng, S.; Wu, H. M., Matrix visualization and information mining, (Proceedings in Computational Statistics 2004, Compstat 2004, (2004), Physica-Verlag Heidelberg), 85-100 · Zbl 1170.62308
[19] Chouakria, A.; Cazes, P.; Diday, E., Symbolic principal component analysis, (Bock, H.-H.; Diday, E., Analysis of Symbolic Data, (2000), Springer Heidelberg), 200-212 · Zbl 0977.62063
[20] de Carvalho, F. A.T.; Brito, B.; Bock, H.-H., Dynamic clustering for interval data based on \(L_2\) distance, Comput. Statist., 21, 231-250, (2006) · Zbl 1114.62070
[21] de Falguerolles, A.; Friedrich, F.; Sawitzki, G., A tribute to J. bertins graphical data analysis, (Bandilla, W.; Faulbaum, F., SoftStat 97, Advances in Statistical Software6, (1997), Lucius & Lucius), 11-20
[22] Denáux, T.; Masson, M., Multidimensinal scaling of interval-valued dissimilarity data, Pattern Recognit. Lett., 21, 83-92, (2000)
[23] Diday, E., La méthode des nuées dynamiques, Rev. Statist. Appl., 19, 2, 19-34, (1971)
[24] Diday, E., The symbolic approach in clustering and related methods of data analysis, (Bock, H.-H., Classification and Related Methods of Data Analysis, (1987), North-Holland Amsterdam), 673-684
[25] Diday, E., An introduction to symbolic data analysis and the SODAS software, J. Symb. Data Anal., 1, (2002)
[26] (Diday, E.; Noirhomme-Fraiture, M., Symbolic Data Analysis and The SODAS Software, (2008), John Wiley & Sons Ltd. Chichester, England) · Zbl 1275.62029
[27] Duarte Silva, A. P.; Brito, P., Linear discriminant analysis for interval data, Comput. Statist., 21, 2, 289-308, (2006) · Zbl 1113.62080
[28] Eisen, M. B.; Spellman, P. T.; Brown, P. O.; Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proc. Natl Acad. Sci. USA, 95, 14863-14868, (1998)
[29] El Golli, A.; Conan-Guez, B.; Rossi, F., A self-organizing map for dissimilarity data, (Banks, D.; House, L.; McMorris, F. R.; Arabie, P.; Gaul, W., Classification, Clustering, and Data Mining Applications, Studies in Classification, Data Analysis, and Knowledge Organization, (2004), Springer Heidelberg), 61-68
[30] Elmqvist, N., Do, T.-N., Goodell, H., Henry, N., Fekete, J.-D., 2008. ZAME: interactive large-scale graph visualization. In Proceedings of the IEEE Pacific Visualization Symposium, pp. 215-222.
[31] Elmqvist, N.; Dragicevic, P.; Fekete, J.-D., Color Lens: adaptive color scale optimization for visual exploration, IEEE Trans. Vis. Comput. Graphics, 17, 6, 795-807, (2011)
[32] Friendly, M., Corrgrams: exploratory displays for correlation matrices, Amer. Statist., 56, 4, 316-324, (2002)
[33] Ghoniem, M.; Fekete, J.; Castagliola, P., On the readability of graphs using node-link and matrix-based representations: a controlled experiment and statistical analysis, Inf. Vis., 4, 2, 114-135, (2005)
[34] Gioia, F.; Lauro, N. C., Principal component analysis on interval data, Comput. Statist., 21, 2, 343-363, (2006) · Zbl 1113.62072
[35] Gowda, K. C.; Diday, E., Symbolic clustering using a new dissimilarity measure, Pattern Recognit., 24, 567-578, (1991)
[36] Groenen, P. J.F.; Winsberg, S.; Rodriguez, O.; Diday, E., I-scal: multidimensionl scaling of interval dissimilarities, Comput. Statist. Data Anal., 51, 360-378, (2006) · Zbl 1157.62450
[37] Guo, J.; Li, W.; Li, C.; Gao, S., Standardization of interval symbolic data based on the empirical descriptive statistics, Comput. Statist. Data Anal., 56, 3, 602-610, (2012) · Zbl 1239.62003
[38] Hamada, A., Minami, H., Mizuta, M., 2008. Principal component analysis for modal interval-valued data. In: Proceedings of IASC2008, the Joint Meeting of 4th World Conference of the IASC and 6th Conference of the Asian Regional Section of the IASC on Computational Statistics & Data Analysis. ISBN 978-4-9904445-1-8, pp. 512-519.
[39] Henry, N.; Fekete, J. D., Matrixexplorer: a dual-representation system to explore social networks, IEEE Trans. Vis. Comput. Graphics, 12, 5, 677-684, (2006)
[40] Hubert, L.; Arabie, P., Comparing partitions, J. Classification, 2, 193-218, (1985)
[41] Ichino, M., General metrics for mixed features-the Cartesian space theory for pattern recognition, (Proceedings of the 1988 Conference on Systems, Man, and Cybernetics, (1988), Pergamon Oxford), 494-497
[42] Irpino, A.; Verde, R.; Lauro, N. C., Visualizing symbolic data by closed shapes, (Shader, M.; Gaul, W.; Vichi, M., Between Data Science and Applied Data Analysis, (2003), Springer-Verlag Berlin), 244-251 · Zbl 05280179
[43] Lauro, N. C.; Palumbo, F., New graphical symbolic objects representations in parallel coordinates, (Schader; Gaul; Vichi, Between Data Science and Applied Data Analysis, (2003), Springer Verlag), 288-295 · Zbl 05280184
[44] Lauro, N. C.; Verde, R.; Palumbo, F., Factorial discriminant analysis on symbolic objects, (Bock, H.-H.; Diday, E., Analysis of Symbolic Data, (2000), Springer-Verlag Berlin), 212-233 · Zbl 0977.62070
[45] Liiv, I., Seriation and matrix reordering methods: an historical overview, Stat. Anal. Data Min., (2010)
[46] Liiv, I.; Opik, R.; Ubi, J.; Stasko, J., Visual matrix explorer for collaborative seriation, Wiley Interdiscip. Rev. Comput. Stat., 4, 1, 85-97, (2012)
[47] Lima Neto, E. A.; De Carvalho, F. A.T., Centre and range method for Fitting a linear regression model to symbolic intervalar data, Comput. Statist. Data Anal., 52, 1500-1515, (2008) · Zbl 1452.62493
[48] Lima Neto, E. A.; De Carvalho, F. A.T., Constrained linear regression models for symbolic interval-valued variables, Comput. Statist. Data Anal., 54, 333-347, (2010) · Zbl 1464.62055
[49] Marchette, D. J.; Solka, J. L., Using data images for outlier detection, Comput. Statist. Data Anal., 43, 541-552, (2003) · Zbl 1429.62040
[50] Micallef, L.; Dragicevic, P.; Fekete, J-D., Assessing the effect of visualizations on Bayesian reasoning through crowdsourcing, IEEE Trans. Vis. Comput. Graphics, 18, 12, 2536-2545, (2012)
[51] Minami, H., Mizuta, M., 2008. Symbolic multidimensional scaling and its application for internet traffic data COMPSTAT2008. In: Porto COMPSTAT’2008 Book of Abstracts, Faculdade de Economia da Universidade do Porto, FEP, 171.
[52] Minnotte, M., West, W., 1998. The data image: a tool for exploring high dimensional data sets. In: Proceedings of the ASA Section on Statistical Graphics, Dallas, Texas, pp. 25-33.
[53] Noirhomme-Fraiture, M.; Rouard, M., Visualizing and editing symbolic objects, (Bock, H.-H.; Diday, E., Analysis of Symbolic Data, (2000), Springer-Verlag Berlin), 125-138 · Zbl 0978.62004
[54] Palumbo, F.; Lauro, N. C., A PCA for interval valued data based on midpoints and radii, (Yanai, H.; Okada, A.; Shigemasu, K.; Kano, Y.; Meulman, J. J., New Developments in Psychometrics, (2003), Springer-Verlag Tokyo), 641-648
[55] Roger, D. P., A method for visualizing multivariate time series data, J. Stat. Softw., 25, 1, 1-17, (2008)
[56] Rosenberg, N. A., DISTRUCT: a program for the graphical display of population structure, Mol. Ecol. Notes, 4, 1, 137-138, (2003)
[57] Saito, T., Miyamura, H.N., Yamamoto, M., Saito, H., Hoshiya, Y., Kaseda, T., 2005. Two-tone pseudo coloring: compact visualization for one-dimensional data. In: Proceedings of the IEEE Symposium on Information Visualization, pp. 173-180.
[58] Sokal, R. R.; Rohlf, F. J., The comparison of dendrograms by objective methods, Taxon, 11, 33-40, (1962)
[59] Souza, R. M.C. R.; de Carvalho, F. A.T., Clustering of interval data based on city-block distances, Pattern Recognit. Lett., 25, 3, 353-365, (2004)
[60] Tien, Y. J.; Lee, Y. S; Wu, H. M.; Chen, C. H., Methods for simultaneously identifying coherent local clusters with smooth global patterns in gene expression profiles, BMC Bioinformatics, 9, 155, (2008)
[61] Tukey, J. W., Exploratory data analysis, (1977), Addison-Wesley · Zbl 0409.62003
[62] Verde, R.; Lechevallier, Y., Crossed clustering method on symbolic data tables, (New Developments in Classification and Data Analysis, (2005), Springer), 87-94 · Zbl 1341.62193
[63] Verde, R.; Lechevallier, Y.; Chavent, M., Symbolic clustering interpretation and visualization, J. Symb. Data Anal., 1, 1, (2003)
[64] Ware, C., Information visualization: perception for design, 103-149, (2004), Morgan Kaufmann
[65] Wegman, E. J., Hyperdimensional data analysis using parallel coordinates, J. Amer. Statist. Assoc., 85, 411, 664-675, (1990)
[66] Weinstein, J. N., A postgenomic visual icon, Science, 319, 1772-1773, (2008)
[67] Wijffelaars, M.; Vliegen, R.; van Wijk, J., Generating color palettes using intuitive parameters, Comput. Graph. Forum, 27, 3, 743-750, (2008)
[68] Wilkinson, L.; Friendly, M., The history of the cluster heat map, Amer. Statist., 63, 2, 179-184, (2009)
[69] Wu, H. M.; Tien, Y. J.; Chen, C. H., GAP: a graphical environment for matrix visualization and cluster analysis, Comput. Statist. Data Anal., 54, 767-778, (2010) · Zbl 1464.62013
[70] Wu, H. M.; Tzeng, S.; Chen, C. H., Matrix visualization, (Chen, Chun-houh; Hardle, Wolfgang; Unwin, Antony, Handbook of Computational Statistics (Volume III): Data Visualization, (2008), Springer-Verlag Heidelberg)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.