×

zbMATH — the first resource for mathematics

Information-maximization clustering based on squared-loss mutual information. (English) Zbl 1418.62257
Summary: Information-maximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it involves only continuous optimization of model parameters, which is substantially simpler than discrete optimization of cluster assignments. However, existing methods still involve nonconvex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this letter, we propose an alternative information-maximization clustering method based on a squared-loss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach.

MSC:
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
Software:
DIFFRAC
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Agakov, F., & Barber, D. (2006). Kernelized infomax clustering. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18 (pp. 17-24). Cambridge, MA: MIT Press.
[2] Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28, 131-142. · Zbl 0203.19902
[3] Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75, 245-249. , · Zbl 1378.68047
[4] Amari, S. (1967). Theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, EC-16, 299-307. , · Zbl 0189.50101
[5] Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50, 5-43. , · Zbl 1033.68081
[6] Antoniak, C. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2, 1152-1174. , · Zbl 0335.60034
[7] Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337-404. , · Zbl 0037.20701
[8] Attias, H. (2000). A variational Baysian framework for graphical models. In S. A. Solla, T. K. Leen, & K.-R. Hüller (Eds.), Advances in neural information processing systems, 12 (pp. 209-215). Cambridge, MA: MIT Press.
[9] Bach, F., & Harchaoui, Z. (2008). DIFFRAC: A discriminative and flexible framework for clustering. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in Neural information processing systems, 20 (pp. 49-56). Cambridge, MA: MIT Press.
[10] Bach, F., & Jordan, M. I. (2006). Learning spectral clustering, with application to speech separation. Journal of Machine Learning Research, 7, 1963-2001. · Zbl 1222.68138
[11] Bao, L., & Intille, S. S. (2004). Activity recognition from user-annotated acceleration data. In Proceedings of 2nd IEEE International Conference on Pervasive Computing (pp. 1-17). Piscataway, NJ: IEEE. ,
[12] Bharatula, N. B., Stager, M., Lukowicz, P., & Troster, G. (2005). Empirical study of design choices in multi-sensor context ecognition. In Proceedings of International Forun on Applied Wearable Computing (pp. 79-93). Berlin: VDE Verlag.
[13] Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer. , · Zbl 1107.68072
[14] Blei, D. M., & Jordan, M. I. (2006). Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1, 121-144. , · Zbl 1331.62259
[15] Carreira-Perpiñán, M. A. (2006). Fast nonparametric clustering with gaussian blurring mean-shift. In Proceedings of 23rd International Conference on Machine Learning (ICML2006) (pp. 153-160). Madison, WI: Omnipress. ,
[16] Carreira-Perpiñán, M. A. (2007). Gaussian mean shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29, 767-776. ,
[17] Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17, 790-799. ,
[18] Chung, F. R. K. (1997). Spectral graph theory. Providence, RI: American Mathematical Society. · Zbl 0867.05046
[19] Cour, T., Gogin, N., & Shi, J. (2005). Learning spectral graph segmentation. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (pp. 65-72). Society for Artificial Intelligence and Statistics.
[20] Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed). Hoboken, NJ: Wiley. · Zbl 1140.94001
[21] Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2, 229-318. · Zbl 0157.25802
[22] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, series B, 39, 1-38. · Zbl 0364.62022
[23] Dhillon, I. S., Guan, Y., & Kulis, B. (2004). Kernel k-means, spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 551-556). New York: ACM Press.
[24] Ding, C., & He, X. (2004). K-means clustering via principal component analysis. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML2004) (pp. 225-232). New York: ACM Press. ,
[25] Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley. · Zbl 0968.68140
[26] Duffy, N., & Collins, M. (2002). Convolution kernels for natural language. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 625-632). Cambridge, MA: MIT Press.
[27] Faivishevsky, L., & Goldberger, J. (2010). A nonparametric information theoretic clustering algorithm. In Proceedings of 27th International Conference on Machine Learning (ICML2010) (pp. 351-358). Madison, WI: Omnipress.
[28] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1, 209-230. , · Zbl 0255.62037
[29] Fukunaga, K., & Hostetler, L. D. (1975). The estimation of the gradient of a density function, with application in pattern recognition. IEEE Transactions on Information Theory, 21, 32-40. , · Zbl 0297.62025
[30] Gärtner, T. (2003). A survey of kernels for structured data. SIGKDD Explorations, 5, S268-S275. ,
[31] Gärtner, T., Flach, P., & Wrobel, S. (2003). On graph kernels: Hardness results and efficient alternatives. In Proceedings of the Sixteenth Annual Conference on Computational Learning Theory (pp. 129-143). New York: Springer. , · Zbl 1274.68312
[32] Ghahramani, Z., & Beal, M. J. (2000). Variational inference for Bayesian mixtures of factor analysers. In S. A. Solla, T. K. Leen, & K.-R. Hüller (Eds.), Advances in neural information processing systems, 12 (pp. 449-455). Cambridge, MA: MIT Press.
[33] Girolami, M. (2002). Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks, 13, 780-784. ,
[34] Golub, G. H., & Loan, C.F.V. (1989). Matrix computations (2nd ed.). Baltimore, MD: Johns Hopkins University Press. · Zbl 0733.65016
[35] Gomes, R., Krause, A., & Perona, P. (2010). Discriminative clustering by regularized information maximization. In J. Lafferty (Ed.), Advances in neural information processing systems, 23 (pp. 766-774). Red Hook, NY: Curran.
[36] Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. In S. Jain, H. U. Simon, & E. Tomita (Eds.), Algorithmic learning theory (pp. 63-77). Berlin: Springer-Verlag. , · Zbl 1168.62354
[37] Hachiya, H., Sugiyama, M., & Ueda, N. (2012). Importance-weighted least-squares probabilistic classifier for covariate shift adaptation with application to human activity recognition. Neurocomputing, 80, 93-101. ,
[38] Härdle, W., Müller, M., Sperlich, S., & Werwatz, A. (2004). Nonparametric and semiparametric models. Berlin: Springer. , · Zbl 1059.62032
[39] Horn, R. A., & Johnson, C. A. (1985). Matrix analysis. Cambridge: Cambridge University Press. , · Zbl 0576.15001
[40] Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193-218. , · Zbl 0587.62128
[41] Joachims, T. (2002). Learning to classify text using support vector machines: Methods, theory and algorithms. Boston: Kluwer. ,
[42] Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag. , · Zbl 1011.62064
[43] Kain, A., & Macon, M. W. (1988). Spectral voice conversion for text-to-speech synthesis. In Proceedings of 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP1998) (pp. 285-288). Piscataway, NJ: IEEE.
[44] Kashima, H., & Koyanagi, T. (2002). Kernels for semi-structured data. In Proceedings of the Nineteenth International Conference on Machine Learning (pp. 291-298). San Francisco: Morgan Kaufmann.
[45] Kashima, H., Tsuda, K., & Inokuchi, A. (2003). Marginalized kernels between labeled graphs. In Proceedings of the Twentieth International Conference on Machine Learning (pp. 321-328). San Francisco: Morgan Kaufmann.
[46] Koltchinskii, V. (1998). Asymptotics of spectral projections of some random matrices approximating integral operators. In D. Khoshnevisan, A. Kyprianov, & R. Sidney (Eds.), Progress in probabilty (pp. 191-227). New York: Springer. , · Zbl 0905.60003
[47] Koltchinskii, V., & Giné, E. (2000). Random matrix approximation of spectra of integral operators. Bernoulli, 6, 113-167. , · Zbl 0949.60078
[48] Kondor, R. I., & Lafferty, J. (2002). Diffusion kernels on graphs and other discrete input spaces. In Proceedings of the Nineteenth International Conference on Machine Learning (pp. 315-322). San Francisco: Morgan Kaufmann.
[49] Kozachenko, L. F., & Leonenko, N. N. (1987). Sample estimate of entropy of a random vector. Problems of Information Transmission, 23, 95-101. · Zbl 0633.62005
[50] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79-86. , · Zbl 0042.38403
[51] Kurihara, K., & Welling, M. (2009). Bayesian k-means as a “maximization-expectation” algorithm. Neural Computation, 21, 1145-1172. , · Zbl 1178.68425
[52] Lee, Y. K., & Ng, H. T. (2002). An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In Proceedings of Conference on Empirical Methods in Natural Language Processing (pp. 41-48). Stroudsburg, PA: Association for Computational Linguistics. ,
[53] Li, Y. F., Tsang, I. W., Kwok, J. T., & Zhou, Z.-H. (2009). Tighter and convex maximum margin clustering. In Proceedings of Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS2009) (pp. 344-351).
[54] Lin, D., Grimson, E., & Fisher, J. (2010). Construction of dependent Dirichlet processes based on Poisson processes. In J. Lafferty (Ed.), Advances in neural information processing systems, 23 (pp. 1387-1395). Red Hook, NY: Curran.
[55] Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. Journal of Machine Learning Research, 2, 419-444. , · Zbl 1013.68176
[56] MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press. · Zbl 1055.94001
[57] MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability (pp. 281-297). Berkeley: University of California Press. · Zbl 0214.46201
[58] Meila, M., & Shi, J. (2001). Learning segmentation by random walks. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 873-879). Cambridge, MA: MIT Press.
[59] Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9, 249-265.
[60] Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 849-856). Cambridge, MA: MIT Press.
[61] Niu, G., Dai, B., Shang, L., & Sugiyama, M. (in press). Maximum volume clustering: A new discriminative clustering approach. Journal of Machine Learning Research. · Zbl 1318.62213
[62] Niu, Z.-Y., Ji, D.-H., & Tan, C. L. (2005). A semi-supervised feature clustering algorithm with application to word sense disambiguation. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (pp. 907-914). Stroudsburg, PA: Association for Computational Linguistics. ,
[63] Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50, 157-175. , · JFM 31.0238.04
[64] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 559-572. , · JFM 32.0246.07
[65] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846-850. ,
[66] Rodríguez, A., Dunson, D. B., & Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American Statistical Association, 103, 1131-1154. , · Zbl 1205.62062
[67] Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. · Zbl 1019.68094
[68] Shental, N., Zomet, A., Hertz, T., & Weiss, Y. (2003). Learning and inferring image segmentations using the GBP typical cut algorithm. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1243-1250). Piscataway, NJ: IEEE. ,
[69] Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888-905. ,
[70] Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall. , · Zbl 0617.62042
[71] Song, L., Smola, A., Gretton, A., & Borgwardt, K. (2007). A dependence maximization view of clustering. In Proceedings of the 24th Annual International Conference on Machine Learning (ICML2007) (pp. 815-822). New York: ACM. ,
[72] Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. Journal of Machine Learning Research, 8, 1027-1061. · Zbl 1222.68312
[73] Sugiyama, M. (2013). Machine learning with squared-loss mutual information. Entropy, 15, 80-112. , · Zbl 1371.68241
[74] Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge: Cambridge University Press. , · Zbl 1274.62037
[75] Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60, 699-746. , · Zbl 1294.62069
[76] Sugiyama, M., Yamada, M., Kimura, M., & Hachiya, H. (2011). On information-maximization clustering: Tuning parameter selection and analytic solution. In Proceedings of 28th International Conference on Machine Learning (ICML2011) (pp. 65-72). Madison, WI: Omnipress.
[77] Suzuki, T., Sugiyama, M., Kanamori, T., & Sese, J. (2009). Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics, 10, S52. ,
[78] Suzuki, T., Sugiyama, M., Sese, J., & Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. In Proceedings of the ECML-PKDD2008 Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery 2008 (FSDM2008) (pp. 5-20). Antwerp, Belgium.
[79] Teh, Y. W., Jordan, M. I., Boal, M. J., & Blei, D. M. (2007). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566-1581. , · Zbl 1171.62349
[80] Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. E. (2000). SMEM algorithm for mixture models. Neural Computation, 12, 2109-2128. ,
[81] Valizadegan, H., & Jin, R. (2007). Generalized maximum margin clustering and unsupervised kernel learning. In B. Schölkopf, J. Platt, & T. Hoffman (Eds.), Advances in neural information processing systems, 19 (pp. 1417-1424). Cambridge, MA: MIT Press.
[82] Vapnik, V. N. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. , · Zbl 0833.62008
[83] von Luxburg, U. (2004). Statistical learning with similarity and dissimilarity functions. Doctoral dissertation, Technical University of Berlin, Berlin, Germany.
[84] Wang, F., Zhao, B., & Zhang, C. (2010). Linear time maximum margin clustering. IEEE Transactions on Neural Networks, 21, 319-332. ,
[85] Xu, L., Neufeld, J., Larson, B., & Schuurmans, D. (2005). Maximum margin clustering. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1537-1544). Cambridge, MA: MIT Press.
[86] Yang, W.-Y., Kwok, J. T., & Lu, B.-L. (2010). Spectral and semidefinite relaxation of the CLUHSIC algorithm. In Proceedings of the 2010 SIAM International Conference on Data Mining (pp. 106-117). Philadelphia: SIAM. ,
[87] Zelnik-Manor, L., & Perona, P. (2005). Self-tuning spectral clustering. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1601-1608). Cambridge, MA: MIT Press.
[88] Zha, H., He, X., Ding, C., Gu, M., & Simon, H. (2002). Spectral relaxation for k-means clustering. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 1057-1064). Cambridge, MA: MIT Press.
[89] Zhang, K., Tsang, I. W., & Kwok, J. T. (2009). Maximum margin clustering made practical. IEEE Transactions on Neural Networks, 20, 583-596. ,
[90] Zhao, B., Wang, F., & Zhang, C. (2008). Maximum margin clustering via cutting plane algorithm. In Proceedings of the 2007 SIAM International Conference on Data Mining (pp. 751-762). Philadelphia: SIAM. ,
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.