×

Multivariate analysis by data depth: Descriptive statistics, graphics and inference. (With discussions and rejoinder). (English) Zbl 0984.62037

Summary: A data depth can be used to measure the “depth” or “outlyingness” of a given multivariate sample with respect to its underlying distribution. This leads to a natural center-outward ordering of the sample points. Based on this ordering, quantitative and graphical methods are introduced for analyzing multivariate distributional characteristics such as location, scale, bias, skewness and kurtosis, as well as for comparing inference methods. All graphs are one-dimensional curves in the plane and can be easily visualized and interpreted.
A “sunburst plot” is presented as a bivariate generalization of the box-plot. DD-(depth versus depth) plots are proposed and examined as graphical inference tools. Some new diagnostic tools for checking multivariate normality are introduced. One of them monitors the exact rate of growth of the maximum deviation from the mean, while the others examine the ratio of the overall dispersion to the dispersion of a certain central region. The affine invariance property of a data depth also leads to appropriate invariance properties for the proposed statistics and methods.

MSC:

62H05 Characterization and structure theory for multivariate probability distributions; copulas
62A09 Graphical methods in statistics
62-07 Data analysis (statistics) (MSC2010)
62J20 Diagnostics, and linear inference and regression

Software:

AS 307
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] ANDERSON, T. 1984. An Introduction to Multivariate Statistical Analysis. Wiley, New York. Z. · Zbl 0651.62041
[2] ANDREWS, D. 1972. Plots of high-dimensional data. Biometrics 28 125 136. Z.
[3] ARCONES, M., CHEN, Z. and GINE, E. 1994. Estimators related to U-processes with applications to multivariate medians: asymptotic normality. Ann. Statist. 22 1460 1477. · Zbl 0827.62023
[4] AVEROUS, J. and MESTE, M. 1997. Skewness for multivariate distributions: two approaches. Ánn. Statist. 25 1984 1997. Z. · Zbl 0882.62045
[5] BARNETT, V. 1976. The ordering of multivariate data. J. Roy. Statist. Soc. Ser. A 139 319 354. Z. JSTOR:
[6] BERAN, R. 1979. Testing for ellipsoidal symmetry of a multivariate density. Ann. Statist. 7 150 162. Z. · Zbl 0406.62029
[7] BERAN, R. and MILLAR, P. 1997. Multivariate symmetry models. In Festschrift for Lucien Le Z. Cam 13 42. L. Le Cam, E. Torgersen and G. Yang, eds. Springer, New York. Z. · Zbl 0948.62039
[8] BICKEL, P. and LEHMANN, E. 1975a. Descriptive statistics for nonparametric models I. Introduction. Ann. Statist. 3 1038 1044. Z. · Zbl 0321.62054
[9] BICKEL, P. and LEHMANN, E. 1975b. Descriptive statistics for nonparametric models II. Location. Ann. Statist. 3 1045 1069. Z. · Zbl 0321.62055
[10] BICKEL, P. and LEHMANN, E. 1976. Descriptive statistics for nonparametric models III. Dispersion. Ann. Statist. 4 1139 1158. Z. · Zbl 0351.62031
[11] BICKEL, P. and LEHMANN, E. 1979. Descriptive statistics for nonparametric models IV. Spread. Z. In Contributions to Statistics, Hajek Memorial Volume J. Jureckova, ed. 33 40. \' Ŕeidel, London. Z. · Zbl 0415.62015
[12] BROWN, B. and HETTMANSPERGER, T. 1989. The affine invariant bivariate version of the sign test. J. Roy. Statist. Soc. B 51 117 125. Z. JSTOR: · Zbl 0675.62036
[13] CHAUDHURI, P. 1996. On a geometric notion of multivariate data. J. Amer. Statist. Assoc. 90 862 872. Z. JSTOR: · Zbl 0869.62040
[14] CHENG, A., LIU, R. and LUXHOJ, J. 1999. Monitoring multivariate aviation safety data: control charts and threshold systems. IIE Transactions. To appear Z.
[15] CHERNOFF, H. 1973. The use of faces to represent points in k-dimensional graphically. J. Amer. Statist. Assoc. 68 361 368. Z.
[16] DONOHO, D. and GASKO, M. 1992. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Statist. 20 1803 1827. Z. · Zbl 0776.62031
[17] DUMBGEN, L. 1992. Limit theorems for simplicial depth. Statist. Probab. Lett. 14 119 128. \" Z. · Zbl 0758.60030
[18] EASTON, G. and MCCULLOCH, R. 1990. A multivariate generalization of quantile quantile plots. J. Amer. Statist. Assoc. 85 376 386. Z. Z.
[19] EDDY, W. 1982. Convex hull peeling. In COMPSTAT H. Caussinus et al., eds. 42 47. Physica, Vienna. Z. · Zbl 0493.62020
[20] EINMAHL, J. and MASON, D. 1992. Generalized quantile process. Ann. Statist. 20 1062 1078. Z. · Zbl 0757.60012
[21] FRAIMAN, R., LIU, R. and MELOCHE, J. 1997. Multivariate density estimation by probing depth. In L -Statistical Procedures and Related Topics 415 430. IMS, Hayward, CA. 1 Z. · Zbl 0919.62050
[22] FRAIMAN, R. and MELOCHE, J. 1996. Multivariate L-estimation. Preprint. Z. · Zbl 0942.62062
[23] FRIEDMAN, J. and RAFSKY, L. 1979. Multivariate generalizations of the Wald Wolfowitz and Smirnov two-sample tests. Ann. Statist. 7 697 717. Z. Z · Zbl 0423.62034
[24] FRIEDMAN, J. and RAFSKY, L. 1981. Graphics for the multivariate two-sample problem with. comments. J. Amer. Statist. Assoc. 76 277 295. Z.
[25] GASTWIRTH, J. 1971. A general definition of the Lorenz curve. Econometrica 39 1037 1039. Z. · Zbl 0245.62082
[26] GNANADESIKAN, R. 1997. Methods for Statistical Data Analysis of Multivariate Observations, 2nd ed. Wiley, New York. Z. · Zbl 0403.62034
[27] HE, X. and WANG, G. 1997. Convergence of depth contours for multivariate datasets. Ann. Statist. 25 495 504. Z. · Zbl 0873.62053
[28] HETTMANSPERGER, T. 1984. Statistical Inference Based on Ranks. Wiley, New York. Z. · Zbl 0592.62031
[29] HETTMANSPERGER, T., NYBLOM, J. and OJA, H. 1992. On multivariate notions of sign and rank. Z. In L-1 Statistical and Related Methods Y. Dodge, ed. 267 278. North-Holland, Amsterdam. Z. · Zbl 0763.62026
[30] HETTMANSPERGER, T. and OJA, H. 1994. Affine invariant multivariate multisample sign tests. J. Roy. Statist. Soc. Ser. B 56 235 249. Z. JSTOR: · Zbl 0795.62056
[31] HODGES, J. 1955. A bivariate sign test. Ann. Math. Statist. 26 523 527. Z. · Zbl 0065.12401
[32] HUBER, P. 1972. Robust statistics: a review. Ann. Math. Statist. 43 1041 1067. Z. · Zbl 0254.62023
[33] HUSLER, J., LIU, R. and SINGH, K. 1999. A formula for the tail probability of a multivariate \" normal distribution and its applications.
[34] KENDALL, K., STUART, A. and ORD, J. K. 1987. Kendall’s Advanced Theory of Statistics 1. Oxford Univ. Press. Z. · Zbl 0621.62001
[35] KLEINER, B. and HARTIGAN, J. 1981. Representing points in many dimensions by trees and Z. castles with comments. J. Amer. Statist. Assoc. 76 260 276. Z. JSTOR: · Zbl 0468.62053
[36] KOLTCHINSKII, V. 1997. M-estimator, convexity and quantiles. Ann. Statist. 25 435 477. Z. · Zbl 0878.62037
[37] LEHMANN, E. 1991. Theory of Point Estimation. Wadsworth and Brooks Cole, Belmont, CA. Z. · Zbl 0801.62025
[38] LIU, R. 1990. On a notion of data depth based on random simplices. Ann. Statist. 18 405 414. Z. · Zbl 0701.62063
[39] LIU, R. 1992. Data depth and multivariate rank tests. In L-1 Statistics and Related Methods Z. Y. Dodge, ed. 279 294. North-Holland, Amsterdam. Z. · Zbl 0772.62031
[40] LIU, R. 1995. Control charts for multivariate processes. J. Amer. Statist. Assoc. 90 1380 1388. Z. JSTOR: · Zbl 0868.62075
[41] LIU, R. and SINGH, K. 1993. A quality index based on data depth and multivariate rank tests. J. Amer. Statist. Assoc. 88 257 260. Z. JSTOR: · Zbl 0772.62031
[42] LIU, R. and SINGH, K. 1997. Notions of limiting P-values based on data depth and bootstrap. J. Amer. Statist. Assoc. 91 266 277. Z. JSTOR: · Zbl 0889.62010
[43] LORENZ, M. 1905. Methods of measuring the concentration of wealth. J. Amer. Statist. Assoc. 9 209 219.Z.
[44] MAHALANOBIS, P. C. 1936. On the generalized distance in statistics. Proc. Nat. Acad. Sci. India 12 49 55. Z. · Zbl 0015.03302
[45] MARDEN, J. 1998. Bivariate qq-plot. Statist. Sinica 8 813 826. Z. · Zbl 0915.62057
[46] MARDIA, K., KENT, J. and BIBBY, J. 1979. Multivariate Analysis. Academic Press, New York. Z. · Zbl 0432.62029
[47] MUIRHEAD, R. 1982. Aspects of Multivariate Statistical Theory. Wiley, New York. Z. · Zbl 0556.62028
[48] NOLAN, D. 1992. Asymptotics for multivariate trimming. Stochastic Process. Appl. 42 157 169. Z. · Zbl 0763.62007
[49] OJA, H. 1983. Descriptive statistics for multivariate distributions. Statist. Probab. Lett. 1 327 332. Z. · Zbl 0517.62051
[50] PARELIUS, J. 1997. Multivariate analysis based on data depth. Ph.D. dissertation. Dept. Statistics, Rutgers Univ., New Jersey. Z. Z.
[51] ROUSSEEUW, P. and HUBERT, M. 1999. Regression depth. with discussion. J. Amer. Statist. Assoc. 4, 388 433. Z. · Zbl 1070.62509
[52] ROUSSEEUW, P. J. and LEROY, A. M. 1987. Robust Regression and Outlier Detection. Wiley, New York. Z. · Zbl 0711.62030
[53] ROUSSEEUW, P. and RUTS, I. 1996. AS 307: bivariate location depth. Appl. Statist. 45 516 526. Z. · Zbl 0905.62002
[54] ROUSSEEUW, P. and RUTS, I. 1997. The bagplot: a bivariate box-and-whiskers plot. Preprint. Z.
[55] ROUSSEEUW, P. and STRUYF, A. 1998. Computing location depth and regression depth in higher dimensions. Statist. Comput. 8, 193 203. Z.
[56] RUTS, I. and ROUSSEEUW, P. 1996. Computing depth contours of bivariate point clouds. Computational Statistics and Data Analysis 23 153 168. Z. · Zbl 0900.62337
[57] SINGH, K. 1991. Majority depth. Unpublished manuscript. Z.
[58] SINGH, K. 1998. Breakdown theory for bootstrap quantiles. Ann. Statist. 26 1719 1732. Z. · Zbl 0929.62053
[59] TUKEY, J. 1975. Mathematics and picturing data. In Proceedings of the 1975 International Congress of Mathematics 2 523 531. Z. · Zbl 0347.62002
[60] WEGMAN, E. 1990. Hyperdimensional data analysis using parallel coordinates. J. Amer. Statist. Assoc. 85 664 675. Z.
[61] YEH, A. and SINGH, K. 1997. Balanced confidence sets based on the Tukey depth. J. Roy. Statist. Soc. Ser. B 3 639 652. JSTOR: · Zbl 1090.62539
[62] HILL CENTER NEW YORK, NEW YORK 10036 RUTGERS UNIVERSITY PISCATAWAY, NEW JERSEY 08854-8019 E-MAIL: rliu@stat.rutgers.edu kern@stat.rutgers.edu BECKKER, R. A., CLEVELAND, W. S. and WILKS, A. R. 1987. Dynamic graphics for data analysis Z. with discussion. Statist. Sci. 2 353 395. Z.
[63] MOSTELLER, F. and TUKEY, J. W. 1977. Data Analysis and Regression. Addison-Wesley, Reading, MA. Z. Z.
[64] SCHERVISH, M. J. 1987. Multivariate analysis with discussion. Statist. Sci. 2 396 433. Z. Z. · Zbl 0955.62590
[65] TUKEY, J. W. 1962. The future of data analysis. Ann. Math. Statist. 33 1 67. Corr: V33 p812 Z. · Zbl 0107.36401
[66] TUKEY, J. W. 1977. Exploratory Data Analysis. Addison-Wesley, Reading, MA. · Zbl 0409.62003
[67] PITTSBURGH, PENNSYLVANIA 15213-3890 E-MAIL: bill@stat.cmu.edu HOUSTON, TEXAS 77005-1892 E-MAIL: scottdw@stat.rice.edu UCU T, where p is the generalized variance, the orthogonal matrix U contains the eigenvectors and C is the diagonal matrix of standardized eigenvalues Z Z.. Z. det C 1. As in Bensmail and Celeux 1996, we use the terms scale, shape and orientation for items, C and U. If z comes from a spherical distribution with the location vector 0 and covariance matrix I, then y UC1 2 1 2z is elliptically symmetric with the location vector, scale, shape C and orientation U. Our plan is to first define a multivariate centered rank vector. This vector, in many ways, represents an extension of the idea of a univariate rank. In addition, it has certain nice affine equivariance properties. We only provide a Z. Z. sketch here; see Hettmansperger, Mottonen and Oja 1998 or Oja 1999 for \" \" details. We then consider the rank covariance matrix, RCM. Visuri, Koivunen Z. and Oja 1999 show that if the standardized eigenvalues and the eigenvectors of the covariance matrix are c c and u,..., u, respectively, 1 p 1 p then c 1 c 1 and u,..., u are the standardized eigenvalues and 1 p 1 p the eigenvectors for the theoretical RCM. The sample RCM is more robust than the sample covariance matrix and, hence, provides a robust estimate of the underlying shape and orientation for the elliptical distribution. This, along with a robust estimate of Wilk’s generalized variance, can be used to robustly estimate. However, here we use only the standardized eigenvalues and the eigenvectors to define a robust version of depth. We next sketch the construction of the rank vector and corresponding sample RCM. We begin with p-dimensional data x,..., x. The volume of 1 n the p-variate simplex determined by x and p observation vectors with indices i i is 1 p
[68] , shape C or orientation U. The log scale facilitates comparison of scale near the centers. Compare Z. these plots to Figure 7 a, b in the paper. The other nice application discussed by the authors is for the comparison of scatter of the multivariate estimates Z. of location; see Figure 8 a, b, c in the paper. The comparison based on ellipses would be quite natural here since, typically, the estimators will have multivariate normal limiting distributions. Another way to compare scales for two distributions is to look at a PP-plot of the elliptical areas for the two samples. Essentially, it is a plot of the empirical cdf’s of the elliptical areas determined by the data in each sample. Z. Z. Figure 3 shows a PP-scale plot of A versus D. Z. Note that beyond 0.5 the empirical cdf’s of the elliptical areas, F u A Z. Z. Z. F u, indicating that D has more scatter or larger scale than A. The area D under the curve could provide a measure and, hence, in the elliptical case, an asymptotically distribution-free test for scale differences. The test statistic then is the Mann Whitney Wilcoxon U-statistic calculated from the depths. In the univariate case, this corresponds to a rank test based on magnitudes of the centered observations. In the comparison in Figure 4, the observed Z. p-value one-sided test is 0.22.
[69] BENSMAIL, H. and CELEUX, G. 1996. Regularized Gaussian discriminant analysis through eigenvalue decomposition. J. Amer. Statist. Assoc. 91 1743 1749. Z. JSTOR: · Zbl 0885.62068
[70] HETTMANSPERGER, T. P., MOTTONEN, J. and OJA, H. 1998. Affine invariant multivariate rank \" ẗests for several samples. Statist. Sinica 8 785 800. Z. · Zbl 0905.62062
[71] OJA, H. 1999. Affine invariant multivariate sign and rank tests and corresponding estimates: a Z. review. Scand. J. Statist. invited paper. Z. · Zbl 0938.62063
[72] VISURI, S., KOIVUNEN, V. and OJA, H. 1999. Sign and rank covariance matrices. Conditionally accepted to the J. Statist. Plann. Inference. · Zbl 0965.62049
[73] UNIVERSITY PARK, PENNSYLVANIA 16802-2111 E-MAIL: tph@stat.psu.edu BECKER, R. A., CLEVELAND, W. S. and WILKS, A. R. 1987. Dynamic graphics for data analysis Z. with discussion. Statist. Sci. 2 353 395. Z.
[74] CHENG, A., LIU, R. and LUXHOJ, J. 1999. Monitoring multivariate processes: control charts, culpability indices, consistency curves and threshold systems. Preprint. Z.
[75] CHENG, A. and OUYANG, M. 1998. On algorithms for computing simplicial depth. Preprint. Z.
[76] GIL, J., STEIGER, W. AND WIGDERSON, A. 1992. Geometric medians. Discrete Math. 108 37 51. Z. · Zbl 0759.68087
[77] JOHNSON, T., KWOK, I. and NG, R. 1998. Fast computation of 2-dimensional depth contours. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. Z. Z.
[78] ROUSSEEUW, P. and HUBERT, M. 1999. Regression depth with discussion. J. Amer. Statist. Assoc. 94 388 433.Z. JSTOR: · Zbl 1007.62060
[79] ROUSSEEUW, P. and RUTS, I. 1996. A5 307: bivariate location depth. Appl. Statist. 45 516 526. Z. · Zbl 0905.62002
[80] ROUSSEEUW, P. and STRUYF, A. 1998. Computing location depth and regression depth in higher dimensions. Statist. Comput. 8 193 203. Z. Z.
[81] SCHERVISH, M. J. 1987. Multivariate analysis with discussion. Statist. Sci. 2 396 433. Z. · Zbl 0955.62590
[82] SCOTT, D. 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York. Z. · Zbl 0850.62006
[83] TENG, J. 1999. New methodology in regression and multivariate quality control via data depth. Ph.D. thesis. Dept. Statistics, Rutgers Univ.
[84] PISCATAWAY, NEW JERSEY 08854-8019 E-MAIL: rliu@stat.rutgers.edu kesar@stat.rutgers.edu
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.