# zbMATH — the first resource for mathematics

Hybrid cluster ensemble framework based on the random combination of data transformation operators. (English) Zbl 1233.68198
Summary: Given a dataset $$P$$ represented by an $$n\times m$$ matrix (where $$n$$ is the number of data points and $$m$$ is the number of attributes), we study the effect of applying transformations to $$P$$ and how this affects the performance of different ensemble algorithms. Specifically, a dataset $$P$$ can be transformed into a new dataset $$P^{\prime}$$ by a set of transformation operators $$\Phi$$ in the instance dimension, such as sub-sampling, super-sampling, noise injection, and so on, and a corresponding set of transformation operators $$\Psi$$ in the attribute dimension. Based on these conventional transformation operators $$\Phi$$ and $$\Psi$$, a general form $$\Omega$$ of the transformation operator is proposed to represent different kinds of transformation operators. Then, two new data transformation operators, known respectively as probabilistic based data sampling operator and probabilistic based attribute sampling operator, are designed to generate new datasets in the ensemble. Next, three new random transformation operators are proposed, which include the random combination of transformation operators in the data dimension, in the attribute dimension, and in both dimensions respectively. Finally, a new cluster ensemble approach is proposed, which integrates the random combination of data transformation operators across different dimensions, a hybrid clustering technique, a confidence measure, and the normalized cut algorithm into the ensemble framework. The experiments show that (i) random combination of transformation operators across different dimensions outperforms most of the conventional data transformation operators for different kinds of datasets. (ii) The proposed cluster ensemble framework performs well on different datasets such as gene expression datasets and datasets in the UCI machine learning repository.

##### MSC:
 68T05 Learning and adaptive systems in artificial intelligence
##### Software:
 [1] Breiman, L., Bagging predictors, Machine learning, 24, 2, 123-140, (1996) · Zbl 0858.68080 [2] Freund, Y.; Schapire, R.E., A decision-theoretic generalization of on-line learning and an application to boosting, Journal of computer and system sciences, 55, 1, 119-139, (1997) · Zbl 0880.68103 [3] Breiman, L., Random forests, Machine learning, 45, 1, 5-32, (2001) · Zbl 1007.68152 [4] Ho, T.K., The random subspace method for constructing decision forests, IEEE transactions on pattern analysis and machine intelligence, 20, 8, 832-844, (1998) [5] Rodriguez, J.J.; Kuncheva, L.I.; Alonso, C.J., Rotation forest: a new classifier ensemble method, IEEE transactions on pattern analysis and machine intelligence, 28, 10, 1619-1630, (2006) [6] Kuncheva, L.I.; Whitaker, C.J., Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Machine learning, 51, 2, 181-207, (2003) · Zbl 1027.68113 [7] Kuncheva, L.I., A theoretical study on six classifier fusion strategies, IEEE transactions on pattern analysis and machine intelligence, 24, 2, 281-286, (2002) [8] Kuncheva, L.I., ‘fuzzy’ vs ‘non-fuzzy’ in combining classifiers designed by boosting, IEEE transactions on fuzzy systems, 11, 6, 729-741, (2003) [9] Yu, Z.; Deng, Z.; Wong, H.S.; Tan, L., Identifying protein kinase-specific phosphorylation sites based on the bagging-adaboost ensemble approach, IEEE transactions on nanobioscience, 9, 2, 132-143, (2010) [10] Gehler, P.; Nowozin, S., On feature combination for multiclass object classification, (), 221-228 [11] Strehl, A.; Ghosh, J., Cluster ensembles—a knowledge reuse framework for combining multiple partitions, Journal of machine learning research, 3, 583-617, (2002) · Zbl 1084.68759 [12] X.Z. Fern, C.E. Brodley, Random projection for high dimensional data clustering: a cluster ensemble approach, in: Proceedings of the 20th International Conference on Machine Learning, 2003, pp. 186-193. [13] Fred, A.L.N.; Jain, A.K., Combining multiple clusterings using evidence accumulation, IEEE transactions on pattern analysis and machine intelligence, 27, 6, 835-850, (2005) [14] Topchy, A.P.; Jain, A.K.; Punch, W.F., Clustering ensembles: models of consensus and weak partitions, IEEE transactions on pattern analysis and machine intelligence, 27, 12, 1866-1881, (2005) [15] Monti, S.; Tamayo, P.; Mesirov, J.; Golub, T., Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data, Machine learning, 52, 91-118, (2003) · Zbl 1039.68103 [16] T. Lange, J.M. Buhmann, Combining partitions by probabilistic label aggregation, in: KDD 2005, 2005, pp. 147-156. [17] Kuncheva, L.I.; Vetrov, D., Evaluation of stability of k-means cluster ensembles with respect to random initialization, IEEE transactions on pattern analysis and machine intelligence, 28, 11, 1798-1808, (2006) [18] Ayad, H.G.; Kamel, M.S., Cumulative voting consensus method for partitions with variable number of clusters, IEEE transactions on pattern analysis and machine intelligence, 30, 1, 16-173, (2008) [19] A.P. Topchy, M.H.C. Law, A.K. Jain, A.L.N. Fred, Analysis of consensus partition in cluster ensemble, in: ICDM 2004, 2004, pp. 225-232. [20] T. Lange, J.M. Buhmann, Combining partitions by probabilistic label aggregation, in: SIGKDD 2005, 2005, pp. 147-156. [21] Monti, S.; Tamayo, P.; Mesirov, J.; Golub, T., Consensus clustering: a resampling based method for class discovery and visualization of gene expression microarray data, Journal of machine learning, 52, 1-2, (2003) · Zbl 1039.68103 [22] Dudoit, S.; Fridlyand, J., A prediction-based resampling method to estimate the number of clusters in a dataset, Genome biology, 3, 7, 0036.1-0036.21, (2002) [23] Dudoit, S.; Fridlyand, J., Bagging to improve the accuracy of a clustering procedure, Bioinformatics, 19, 1090-1099, (2003) [24] Smolkin, M.; Ghosh, D., Cluster stability scores for microarray data in cancer studies, BMC bioinformatics, 4, 36, (2003) [25] Bertoni, A.; Valentini, G., Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses, Artificial intelligence in medicine, 37, 2, 85-109, (2006) [26] Valentini, G., Mosclust: a software library for discovering significant structures in bio-molecular data, Bioinformatics, 23, 3, 387-389, (2007) [27] Bertoni, A.; Valentini, G., Discovering multi-level structures in bio-molecular data through the Bernstein inequality, BMC bioinformatics, 9, Suppl 2:S4, 1-9, (2008) [28] Yu, Z.; Wong, H.S.; Wang, H., Graph based consensus clustering for class discovery from gene expression data, Bioinformatics, 23, 21, 2888-2896, (2007) [29] Yu, Z.; Wong, H.S., Class discovery from gene expression data based on perturbation and cluster ensemble, IEEE transactions on nanobioscience, 8, 2, 147-160, (2009) [30] Z. Yu, Z. Deng, H.S. Wong, Identification of phosphorylation sites using a hybrid classifier ensemble approach, in: IEEE International Conference on Pattern Recognition 2008 (ICPR2008), Tampa, FL, USA, 2008, pp. 1-4. [31] Hu, X.; Park, E.K.; Zhang, X., Microarray gene cluster identification and annotation through cluster ensemble and EM based informative textual summarization, IEEE transactions on information technology in biomedicine, 13, 5, 832-840, (2009) [32] Martinetz, T.M.; Berkovich, G.; Schulten, K.J., Neural-gas network for vector quantization and its application to times-series prediction, IEEE transactions on neural networks, 4, 4, 558-569, (1993) [33] Ferrari, S.; Ferrigno, G.; Piuri, V.; Borghese, N.A., Reducing and filtering point clouds with enhanced vector quantization, IEEE transactions on neural networks, 18, 1, 161-177, (2007) [34] Kohonen, T., Self-organizing maps, (1997), Springer-Verlag Heidelberg · Zbl 0866.68085 [35] Laana, M.; Pollardb, K.; Bryan, J., A new partitioning around medoids algorithm, Journal of statistical computation and simulation, 73, 8, 575-584, (2003) · Zbl 1054.62075 [36] Rand, W.M., Objective criteria for the evaluation of clustering methods, Journal of the American statistical association, 66, 846-850, (1971) [37] Shi, J.; Malik, J., Normalized cuts and image segmentation, IEEE transactions on pattern analysis and machine intelligence, 22, 8, 888-905, (2000) [38] Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.; Downing, J.; Caligiuri, M.; Bloomfield, C.; Lander, E., Molecular classification of cancer: class discovery and class prediction by gene expression, Science, 286, 5439, 531-537, (1999) [39] Su, A.I.; Cooke, M.P.; Ching, K.A.; Hakak, Y.; Walker, J.R.; Wiltshire, T.; Orth, A.P.; Vega, R.G.; Sapinoso, L.M.; Moqrich, A.; Patapoutian, A.; Hampton, G.M.; Schultz, P.G.; Hogenesch, J.B., Large-scale analysis of the human and mouse transcriptomes, Proceedings of the national Academy of sciences, 99, 7, 4465-4470, (2002) [40] Pomeroy, S.; Tamayo, P.; Gaasenbeek, M., Gene expression-based classification and outcome prediction of central nervous system embryonal tumors, Nature, 415, 6870, 436-442, (2002) [41] Ramaswamy, S.; Tamayo, P.; Rifkin, R.; Mukherjee, S.; Yeang, C.-H.; Angelo, M.; Ladd, C.; Reich, M.; Latulippe, E.; Mesirov, J.P.; Poggio, T.; Gerald, W.; Loda, M.; Lander, E.S.; Golub, T.R., Multi-class cancer diagnosis using tumor gene expression signatures, Proceedings of the national Academy of sciences, 98, 26, 15149-15154, (2001) [42] A. Frank, A. Asuncion. UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, 2010 $$\langle$$http://archive.ics.uci.edu/ml〉. [43] F. Orabona, L. Jie, B. Caputo, Online-batch strongly convex multi kernel learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, June 2010. · Zbl 1283.68296 [44] A. Topchy, A.K. Jain, W. Punch, Combining multiple weak clusterings, in: Proceedings of the IEEE International Conference on Data Mining, 2003, pp. 331-338.