×

Generalized density clustering. (English) Zbl 1200.62066

Summary: We study generalized density-based clustering in which sharply defined clusters such as clusters on lower-dimensional manifolds are allowed. We show that accurate clustering is possible even in high dimensions. We propose two data-based methods for choosing the bandwidth and we study the stability properties of density clusters. We show that a simple graph-based algorithm successfully approximates the high density clusters.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G07 Density estimation
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Ambrosio, L., Fusco, N. and Pallara, D. (2000). Functions of Bounded Variation and Free Discontinuity Problems . Clarendon, Oxford. · Zbl 0957.49001
[2] Audibert, J. and Tsybakov, A. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608-633. · Zbl 1118.62041
[3] Baíllo, A., Cuesta-Albertos, J. and Cuevas, A. (2001). Convergence rates in nonparametric estimation of level sets. Statist. Probab. Lett. 53 27-35. · Zbl 0980.62022
[4] Ben-David, S., von Luxburg, U. and Pall, D. (2006). A sober look at clustering stability. In Learning Theory. Lecture Notes in Computer Science 4005 5-19. Springer, Berlin. · Zbl 1143.68520
[5] Ben-Hur, A., Elisseeff, A. and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing (R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale and T. E. Klein, eds.) 6-17. World Scientific, New York.
[6] Biau, G., Cadre, B. and Pellettier, B. (2007). A graph-based estimator of the number of clusters. ESAIM Probab. Stat. 11 272-280. · Zbl 1187.62114
[7] Cadre, B. (2006). Kernel density estimation on level sets. J. Multivariate Anal. 97 999-1023. · Zbl 1085.62039
[8] Castro, R. and Nowak, R. (2008). Minimax bounds for active learning. IEEE Trans. Inform. Theory 5 2339-2353. · Zbl 1330.68246
[9] Cormen, T., Leiserson, C., Rivest, R. and Stein, C. (2002). Introduction to Algorithms . McGraw-Hill, New York. · Zbl 1187.68679
[10] Cuevas, A., Febrero, M. and Fraiman, R. (2000). Estimating the number of clusters. Canad. J. Statist. 28 . JSTOR: · Zbl 0981.62054
[11] Cuevas, A. and Fraiman, R. (1997). A plug-in approach to support estimation. Ann. Statist. 25 2300-2312. · Zbl 0897.62034
[12] Cuevas, A., Gonzàlez-Manteiga, W. and Rodrìguez-Casal, A. (2006). Plug-in estimation of general level sets. Aust. N. Z. J. Stat. 48 7-19. · Zbl 1108.62036
[13] Devroye, L., Györfi, L. and Lugosi, G. (1997). A Probabilistic Theory of Pattern Recognition . Springer, New York. · Zbl 0853.68150
[14] Devroye, L. and Wise, L. (1980). Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math. 38 480-488. JSTOR: · Zbl 0479.62028
[15] Einmahl, U. and Mason, D. (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist. 33 1380-1403. · Zbl 1079.62040
[16] Evans, L. and Gariepy, R. (1992). Measure Theory and Fine Properties of Functions . CRC Press, Boca Raton, FL. · Zbl 0804.28001
[17] Falconer, K. (2003). Fractal Geometry: Mathematical Foundations and Applications . Wiley, Hoboken, NJ. · Zbl 1060.28005
[18] Federer, H. (1969). Geometric Measure Theory . Springer, New York. · Zbl 0176.00801
[19] Giné, E. and Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincaré Probab. Statist. 38 907-921. · Zbl 1011.62034
[20] Giné, E. and Koltchinskii, V. (2006). Empirical graph laplacian approximation of laplace-beltrami operators: Large sample results. In High Dimensional Probability (E. Giné, V. Koltchinskii, W. Li and J. Zinn, eds.) 238-259. · Zbl 1124.60030
[21] Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression . Springer, New York. · Zbl 1021.62024
[22] Hartigan, J. (1975). Clustering Algorithms . Wiley, New York. · Zbl 0372.62040
[23] Jang, W. and Hendry, M. (2007). Cluster analysis of massive datasets in astronomy. Stat. Comput. 17 253-262.
[24] Korostelev, A. and Tsybakov, A. (1993). Minimax Theory of Image Reconstruction . Springer, New York. · Zbl 0833.62039
[25] Lange, T., Roth, V., Braun, M. and J.Buhmann (2004). Stability-based validation of clustering solutions. Neural Computation 16 1299-1323. · Zbl 1089.68100
[26] Lee, J. (2003). Introduction to Smooth Manifolds . Springer, New York. · Zbl 1030.53001
[27] Leoni, G. and Fonseca, I. (2007). Modern Methods in the Calculus of Variations: L p Spaces . Springer, New York. · Zbl 1153.49001
[28] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808-1829. · Zbl 0961.62058
[29] Massart, P. (2000). About the constants in Talagrand’s concentration inequalities for empirical processes. Ann. Probab. 28 863-884. · Zbl 1140.60310
[30] Mattila, P. (1999). Geometry of Sets and Measures in Euclidean Spaces: Fractals and Rectifiability . Cambridge Univ. Press, Cambridge. · Zbl 0911.28005
[31] Mueller, D. and Sawitzki, G. (1991). Excess mass estimates and test for multimodality. J. Amer. Statist. Assoc. 86 738-746. JSTOR: · Zbl 0733.62040
[32] Ng, A., Jordan, M. and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 849-856.
[33] Niyogi, P., Smale, S. and Weinberger, S. (2008). Finding the homology of submanifolds with high confidence. Discrete and Compuational Geometry 38 419-441. · Zbl 1148.68048
[34] Nolan, D. and Pollard, D. (1987). U-processes: Rates of convergence. Ann. Statist. 15 780-799. · Zbl 0624.60048
[35] Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters-an excess mass approach. Ann. Statist. 23 855-881. · Zbl 0841.62045
[36] Rigollet, P. (2007). Generalization error bounds in semi-supervised classification under the cluster assumption. J. Mach. Learn. Res. 8 1369-1392. · Zbl 1222.68288
[37] Rigollet, P. and Vert, R. (2006). Fast rates for plug-in estimators of density level sets. Available at . · Zbl 1200.62034
[38] Scott, C. and Nowak, R. (2006). Learning minimum volume sets. J. Mach. Learn. Res. 7 665-704. · Zbl 1222.68300
[39] Singh, A., Nowak, R. and Zhu, X. (2008). Unlabeled data: Now it helps, now it doesn’t. In Neural Information Processing Systems .
[40] Singh, A., Scott, C. and Nowak, R. (2009). Adaptive hausdorff estimation of density level sets. Ann. Statist. 37 2760-2782. · Zbl 1173.62019
[41] Steinwart, I., Hush, D. and Scovel, C. (2005). A classification for anomaly detection. J. Mach. Learn. Res. 6 211-232. · Zbl 1222.68309
[42] Stuetzle, W. and Nugent, R. (2010). A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Statist. 19 397-418.
[43] Tsybakov, A. (1997). On nonparametric eatimaiton of density level sets. Ann. Statist. 25 948-969. · Zbl 0881.62039
[44] Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135-166. · Zbl 1105.62353
[45] van der Vaart, A. and Wellner, J. (1996). Weak Convergence of Empirical Processes . Springer, New York. · Zbl 0862.60002
[46] von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17 395-416.
[47] Willett, R. and Nowak, R. (2007). Minimax optimal level-set estimation. IEEE Trans. Image Process. 12 2965-2979. · Zbl 05516503
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.