## Extending fuzzy and probabilistic clustering to very large data sets.(English)Zbl 1157.62435

Summary: Approximating clusters in very large (VL=unloadable) data sets has been considered from many angles. The proposed approach has three basic steps: (i) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (ii) clustering the sample with a literal (or exact) algorithm; and (iii) non-iterative extension of the literal clusters to the remainder of the data set. Extension accelerates clustering on all (loadable) data sets. More importantly, extension provides feasibility-a way to find (approximate) clusters-for data sets that are too large to be loaded into the primary memory of a single computer. A good generalized sampling and extension scheme should be effective for acceleration and feasibility using any extensible clustering algorithm. A general method for progressive sampling in VL sets of feature vectors is developed, and examples are given that show how to extend the literal fuzzy ($$c$$-means) and probabilistic (expectation-maximization) clustering algorithms onto VL data. The fuzzy extension is called the generalized extensible fast fuzzy $$c$$-means (geFFCM) algorithm and is illustrated using several experiments with mixtures of five-dimensional normal distributions.

### MSC:

 62H30 Classification and discrimination; cluster analysis (statistical aspects) 65C60 Computational problems in statistics (MSC2010)
Full Text:

### References:

 [1] Baeza-Yates, R.; Ribeiro-Neto, B., Modern information retrieval, (1999), Addison Wesley Longman Reading, MA [2] Bezdek, J.C.; Dunn, J.C., Optimal fuzzy partitions: a heuristic for estimating the parameters in a mixture of normal distributions, IEEE trans. comput., 24, 8, 835-838, (1975) · Zbl 0308.68078 [3] Bezdek, J.C., Hathaway, R.J., 2002. VAT: a tool for visual assessment of (cluster) tendency. In: Proceedings of the IJCNN 2002, IEEE Press, Piscataway, NJ, pp. 2225-2230. [4] Bezdek, J.C.; Hathaway, R.J., Convergence of alternating optimization, Neural, parallel scient. comput., 11, 351-368, (2003) · Zbl 1063.90051 [5] Bezdek, J.C.; Hathaway, R.J.; Huggins, V.J., Parametric estimation for normal mixtures, Pattern recognition lett., 3, 79-84, (1985) · Zbl 0555.62025 [6] Bezdek, J.C.; Li, W.Q.; Attikiouzel, Y.A.; Windham, M.P., A geometric approach to cluster validity for normal mixtures, Soft comput., 1, 166-179, (1997) [7] Bezdek, J.C.; Keller, J.M.; Krishnapuram, R.; Pal, N.R., Fuzzy models and algorithms for pattern recognition and image processing, (1999), Kluwer Norwell · Zbl 0998.68138 [8] Bradley, P., Fayyad, U., Reina, C., 1998a. Scaling clustering algorithms to large databases, In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, pp. 9-15. [9] Bradley, P., Fayyad, U., Reina, C., 1998b. Scaling EM (expectation-maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research, Redmond, WA. [10] Cannon, R.L.; Dave, J.V.; Bezdek, J.C., Efficient implementation of the fuzzy $$c$$-means algorithm, IEEE trans. PAMI, 8, 248-255, (1986) · Zbl 0602.68084 [11] Cheng, T.W., Goldgof, D.B., Hall, L.O., 1995. Fast clustering with application to fuzzy rule generation. In: Proceedings of the IEEE International Conference on Fuzzy Systems, Tokyo, Japan, pp. 2289-2295. [12] Cutting, D.R., Karger, D.R., Pederson, J.O., Tukey, J.W., 1992. Scatter/gather: a cluster-based approach to browsing large document collections, In: Proceedings of the ACM SIGIR’92, pp. 318-329. [13] Domingos, P., Hulten, G., 2001. A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on Machine Learning, pp. 106-113. [14] Eschrich, S.; Ke, J.; Hall, L.O.; Goldgof, D.B., Fast accurate fuzzy clustering through data reduction, IEEE trans. fuzzy systems, 11, 262-269, (2003) [15] Farnstrom, F., Lewis, J., Elkan, C., 2000. Scalability for clustering algorithms revisited. SIGKKD Explorations, vol. 2(1). ACM press, New York, pp. 1-7. [16] Fayyad, U., Smyth, P., 1996. From massive data sets to science catalogs: applications and challenges. In: Kettenring, J., Pregibon, D. (Eds.), Proceedings of the Workshop on Massive Data Sets, National Research Council. [17] Fraley, C.; Raftery, A.E., Model-based clustering, discriminant analysis, and density estimation, J. amer. statist. assoc., 97, 458, 611-631, (2002) · Zbl 1073.62545 [18] Ganti, V., Gehrke, J., Ramakrishnan, R., 1999a. Mining very large databases, Computer, August, pp. 38-45. [19] Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.L., French, J.C., 1999b. Clustering large datasets in arbitrary metric spaces, Proceedings of the 15th International Conference on Data Engineering, IEEE CS Press, Los Alamitos, CA, pp. 502-511. [20] Guha, S., Rastogi, R., Shim, K., 1998. CURE: an efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 73-84. · Zbl 1006.68661 [21] Hathaway, R.J.; Bezdek, J.C., On the asymptotic properties of fuzzy $$c$$-means cluster prototypes as estimators of mixture subpopulation centers, Comm. statist. (A), 15, 2, 505-513, (1986) · Zbl 0608.62072 [22] Hathaway, R.J.; Redner, R.; Bezdek, J.C., Estimating the parameters of mixture models with modal estimators, Comm. statist. (A), 16, 9, 2639-2660, (1987) [23] Huber, P., 1996. Massive Data Sets Workshop: The Morning After, Massive Data Sets. National Academy Press, pp. 169-184. [24] Jain, A.K.; Dubes, R.C., Algorithms for clustering data, (1988), Prentice-Hall Englewood Cliffs, NJ · Zbl 0665.62061 [25] Kolen, J.F.; Hutcheson, T., Reducing the time complexity of the fuzzy $$c$$-means algorithm, IEEE trans. fuzzy systems, 10, 263-267, (2002) [26] McLachlan, G.; Peel, D., Finite mixture models, (2000), Wiley New York · Zbl 0963.62061 [27] Meek, C.; Thiesson, B.; Heckerman, D., The learning curve sampling method applied to model based clustering, J. Mach. learning res., 2, 397-418, (2002) · Zbl 1007.68082 [28] Ng, R.T., Han, J., 1994. Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th International Conference on Very Large Databases, Morgan Kauffman, San Francisco, pp. 144-155. [29] Pal, N.R.; Bezdek, J.C., Complexity reduction for large image processing, IEEE trans. systems, man, cybernet. part B: cybernet., 32, 598-611, (2002) [30] Provost, F., Jensen, D., Oates, T., 1999. Efficient progressive sampling, In: Proceedings of the Fifth KDDM, ACM Press, New York, pp. 23-32. [31] Titterington, D.; Smith, A.; Makov, U., Statistical analysis of finite mixture distributions, (1985), Wiley New York · Zbl 0646.62013 [32] Uma Shankar, B., Pal, N.R., 1994. FFCM: an effective approach for large data sets. In: Proceedings of the Third International Conference on Fuzzy Logic, Neural nets, and Soft Computing, IIZUKA, Fukuoka, Japan, pp. 332-332. [33] Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM Press, New york, pp. 103-114.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.