zbMATH — the first resource for mathematics

Multivariate analysis by data depth: Descriptive statistics, graphics and inference. (With discussions and rejoinder). (English) Zbl 0984.62037
Summary: A data depth can be used to measure the “depth” or “outlyingness” of a given multivariate sample with respect to its underlying distribution. This leads to a natural center-outward ordering of the sample points. Based on this ordering, quantitative and graphical methods are introduced for analyzing multivariate distributional characteristics such as location, scale, bias, skewness and kurtosis, as well as for comparing inference methods. All graphs are one-dimensional curves in the plane and can be easily visualized and interpreted.
A “sunburst plot” is presented as a bivariate generalization of the box-plot. DD-(depth versus depth) plots are proposed and examined as graphical inference tools. Some new diagnostic tools for checking multivariate normality are introduced. One of them monitors the exact rate of growth of the maximum deviation from the mean, while the others examine the ratio of the overall dispersion to the dispersion of a certain central region. The affine invariance property of a data depth also leads to appropriate invariance properties for the proposed statistics and methods.

62H05 Characterization and structure theory for multivariate probability distributions; copulas
62-09 Graphical methods in statistics (MSC2010)
62-07 Data analysis (statistics) (MSC2010)
62J20 Diagnostics, and linear inference and regression
AS 307
Full Text: DOI
HILL CENTER NEW YORK, NEW YORK 10036 RUTGERS UNIVERSITY PISCATAWAY, NEW JERSEY 08854-8019 E-MAIL: rliu@stat.rutgers.edu kern@stat.rutgers.edu
UCU T, where p is the generalized variance, the orthogonal matrix U contains the eigenvectors and C is the diagonal matrix of standardized eigenvalues Z Z.. Z. det C 1. As in Bensmail and Celeux 1996, we use the terms scale, shape and orientation for items, C and U. If z comes from a spherical distribution with the location vector 0 and covariance matrix I, then y UC1 2 1 2z is elliptically symmetric with the location vector, scale, shape C and orientation U. Our plan is to first define a multivariate centered rank vector. This vector, in many ways, represents an extension of the idea of a univariate rank. In addition, it has certain nice affine equivariance properties. We only provide a Z. Z. sketch here; see Hettmansperger, Mottonen and Oja 1998 or Oja 1999 for \" \" details. We then consider the rank covariance matrix, RCM. Visuri, Koivunen Z. and Oja 1999 show that if the standardized eigenvalues and the eigenvectors of the covariance matrix are c c and u,..., u, respectively, 1 p 1 p then c 1 c 1 and u,..., u are the standardized eigenvalues and 1 p 1 p the eigenvectors for the theoretical RCM. The sample RCM is more robust than the sample covariance matrix and, hence, provides a robust estimate of the underlying shape and orientation for the elliptical distribution. This, along with a robust estimate of Wilk's generalized variance, can be used to robustly estimate. However, here we use only the standardized eigenvalues and the eigenvectors to define a robust version of depth. We next sketch the construction of the rank vector and corresponding sample RCM. We begin with p-dimensional data x,..., x. The volume of 1 n the p-variate simplex determined by x and p observation vectors with indices i i is 1 p
, shape C or orientation U. The log scale facilitates comparison of scale near the centers. Compare Z. these plots to Figure 7 a, b in the paper. The other nice application discussed by the authors is for the comparison of scatter of the multivariate estimates Z. of location; see Figure 8 a, b, c in the paper. The comparison based on ellipses would be quite natural here since, typically, the estimators will have multivariate normal limiting distributions. Another way to compare scales for two distributions is to look at a PP-plot of the elliptical areas for the two samples. Essentially, it is a plot of the empirical cdf's of the elliptical areas determined by the data in each sample. Z. Z. Figure 3 shows a PP-scale plot of A versus D. Z. Z. Z. F u, indicating that D has more scatter or larger scale than A. The area D under the curve could provide a measure and, hence, in the elliptical case, an asymptotically distribution-free test for scale differences. The test statistic then is the Mann Whitney Wilcoxon U-statistic calculated from the depths. In the univariate case, this corresponds to a rank test based on magnitudes of the centered observations. In the comparison in Figure 4, the observed Z. p-value one-sided test is 0.22.
CHENG, A., LIU, R. and LUXHOJ, J. 1999. Monitoring multivariate processes: control charts, culpability indices, consistency curves and threshold systems. Preprint.
PISCATAWAY, NEW JERSEY 08854-8019 E-MAIL: rliu@stat.rutgers.edu kesar@stat.rutgers.edu
