## Selective sampling for approximate clustering of very large data sets.(English)Zbl 1140.68486

Summary: A key challenge in pattern recognition is how to scale the computational efficiency of clustering algorithms on large data sets. The extension of Non-Euclidean Relational Fuzzy c-means (NERF) clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF) clustering algorithm, which comprises four phases: (i) finding distinguished features that monitor progressive sampling; (ii) progressively sampling from a $$N \times N$$ relational matrix R$$_N$$ to obtain an $$n \times n$$ sample matrix R$$_n$$; (iii) clustering R$$_n$$ with literal NERF; and (iv) extending the clusters in R$$_n$$ to the remainder of the relational data. Previously published examples on several fairly small data sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases (i) and (ii), i.e., finding R$$_n$$, are not very practical because the sample size $$n$$ often turns out to be roughly 50% of $$N$$, and this over-sampling defeats the whole purpose of eNERF.
In this paper, we examine the performance of the sampling scheme of eNERF with respect to different parameters. We propose a modified sampling scheme for use with eNERF that combines simple random sampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT (scalable visual assessment of clustering tendency). We demonstrate that our modified sampling scheme can eliminate over-sampling of the original progressive sampling scheme, thus enabling the processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000 vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility and effectiveness of the proposed sampling method. We also find that actually running eNERF on a data set of this size is very costly in terms of computation time. Thus, our results demonstrate that further modification of eNERF, especially the extension stage, will be needed before it is truly practical for VL data.

### MSC:

 68T10 Pattern recognition, speech recognition

NERF
Full Text:

### References:

 [1] Massive data workshop: The morning after. In: Massive data sets. New York: National Academy Press; 1996. pp. 169–184. [2] , , Fuzzy models and algorithms for pattern recognition and image processing. Norwell, MA: Kluwer; 1999. · Zbl 0998.68138 [3] Pal, IEEE Trans Systems, Man Cybernet B 32 pp 598– (2002) [4] Hathaway, Comp Stat Anal 51 pp 215– (2006) [5] Bezdek, Int J Intell Syst 21 pp 817– (2006) [6] Hathaway, Pattern Recognit 39 pp 1315– (2006) [7] VAT: A tool for visual assessment of (cluster) tendency. In: Proc Int Joint Conf on Neural Networks; 2002. pp. 2225–2230. [8] Cheng, Fuzzy Sets Syst 93 pp 49– (1998) [9] , Towards scalable support vector machines using squashing. In: Proc ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, 2000. pp. 295–299. [10] , Efficient progressive sampling. In: Proc 5th knowledge discovery and data mining, New York; New York: ACM Press; 1999. pp. 23–32. [11] Meek, J Mach Learn Res 2 pp 397– (2002) [12] , Scaling clustering algorithms to large databases. In: Proc Int Conf on Knowledge Discovery and Data Mining, 1998. pp. 9–15. [13] GenIc: A single pass generalized incremental algorithm for clustering. In: Proc SIAM Int Conf on Data Mining, 2004. [14] A general method for scaling up machine learning algorithms and its application to clustering. In: Proc Int Conf on Machine Learning, 2001. pp. 106–113. [15] , Scaling clustering algorithms to large databases. In: Proc 4th Int Conf Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press; 1998. pp. 9–15. [16] , BIRCH: An efficient data clustering method for very large databases. In: Proc ACM SIGMOD Int Conf on Management of Data. New York: ACM Press; 1996. pp. 103–114. [17] , , , Clustering large data sets in arbitrary metric spaces. In: Proc 15th Int. Conf on Data Engineering. Los Alamitos, CA: IEEE CS Press; 1999. pp. 502–511. [18] Efficient and effective clustering methods for spatial data mining. In: Proc. 20th Int Conf on Very Large Databases. San Francisco, CA: Morgan Kauffman; 1994. pp. 144–155. [19] , CURE: An efficient clustering algorithm for large databases. In: Proc ACM SIGMOD Int Conf. on Management of Data, 1998. pp. 73–84. [20] Hathaway, Pattern Recognit 27 pp 429– (1994) [21] Indices of partition fuzziness and the detection of clusters in large data sets. In: editor. Fuzzy automata and decision processes. New York: Elsevier; 1976. [22] , , Efficiently determining the starting sample size for progressive sampling. In: Proc Euro Conf on Machine Learning, 2001. pp. 192–202. · Zbl 1007.68538
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.