×

zbMATH — the first resource for mathematics

Machine learning with squared-loss mutual information. (English) Zbl 1371.68241
Summary: Mutual information (MI) is useful for detecting statistical independence between random variables, and it has been successfully applied to solving various machine learning problems. Recently, an alternative to MI called squared-loss MI (SMI) was introduced. While ordinary MI is the Kullback-Leibler divergence from the joint distribution to the product of the marginal distributions, SMI is its Pearson divergence variant. Because both the divergences belong to the \(f\)-divergence family, they share similar theoretical properties. However, a notable advantage of SMI is that it can be approximated from data in a computationally more efficient and numerically more stable way than ordinary MI. In this article, we review recent development in SMI approximation based on direct density-ratio estimation and SMI-based machine learning techniques such as independence testing, dimensionality reduction, canonical dependency analysis, independent component analysis, object matching, clustering, and causal inference.

MSC:
68T05 Learning and adaptive systems in artificial intelligence
62B10 Statistical aspects of information-theoretic topics
62H25 Factor analysis and principal components; correspondence analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Software:
DIFFRAC
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Shannon, A mathematical theory of communication, AT&T Tech. J. 27 pp 379– (1948)
[2] Cover, Elements of Information Theory (2006)
[3] DOI: 10.1214/aoms/1177729694 · Zbl 0042.38403 · doi:10.1214/aoms/1177729694
[4] DOI: 10.1103/PhysRevA.33.1134 · Zbl 1184.37027 · doi:10.1103/PhysRevA.33.1134
[5] Vapnik, Statistical Learning Theory (1998)
[6] DOI: 10.1109/18.761290 · Zbl 0957.94006 · doi:10.1109/18.761290
[7] DOI: 10.1109/TIT.2005.853314 · Zbl 1310.94055 · doi:10.1109/TIT.2005.853314
[9] DOI: 10.1103/PhysRevE.69.066138 · doi:10.1103/PhysRevE.69.066138
[10] DOI: 10.1103/PhysRevE.76.026209 · doi:10.1103/PhysRevE.76.026209
[12] DOI: 10.1162/0899766054323026 · Zbl 1076.62013 · doi:10.1162/0899766054323026
[14] DOI: 10.1007/s10463-008-0197-x · Zbl 1294.62069 · doi:10.1007/s10463-008-0197-x
[15] DOI: 10.1109/TIT.2010.2068870 · Zbl 1366.62071 · doi:10.1109/TIT.2010.2068870
[16] Sugiyama, Density Ratio Estimation in Machine Learning (2012)
[17] DOI: 10.1093/biomet/85.3.549 · Zbl 0926.62021 · doi:10.1093/biomet/85.3.549
[18] DOI: 10.1007/s10463-011-0343-8 · Zbl 1440.62111 · doi:10.1007/s10463-011-0343-8
[19] DOI: 10.1186/1471-2105-10-S1-S52 · Zbl 05739987 · doi:10.1186/1471-2105-10-S1-S52
[20] DOI: 10.1080/14786440009463897 · doi:10.1080/14786440009463897
[21] Ali, A general class of coefficients of divergence of one distribution from another, J. R. Stat. Soc. Series B 28 pp 131– (1966) · Zbl 0203.19902
[22] Csiszár, Information-type measures of difference of probability distributions and indirect observation, Stud. Sci. Math. Hung. 2 pp 229– (1967)
[23] Kanamori, A least-squares approach to direct importance estimation, J. Mach. Learn. Res. 10 pp 1391– (2009) · Zbl 1235.62039
[24] DOI: 10.1007/s10994-011-5266-3 · Zbl 1246.68182 · doi:10.1007/s10994-011-5266-3
[26] DOI: 10.1587/transinf.E94.D.1333 · doi:10.1587/transinf.E94.D.1333
[28] Sufficient dimension reduction via squared-loss mutual information estimationsugiyama-www.cs.titech.ac.jp/.../AISTATS2010b.pdf
[30] DOI: 10.1016/j.neunet.2012.06.009 · Zbl 1258.68115 · doi:10.1016/j.neunet.2012.06.009
[31] DOI: 10.1162/NECO_a_00062 · Zbl 1205.94040 · doi:10.1162/NECO_a_00062
[34] Kimura, Dependence-maximization clustering with least-squares mutual information, J. Adv. Comput. Intell. Intell. Inf. 15 pp 800– (2011)
[36] Van der Vaart, Weak Convergence and Empirical Processes with Applications to Statistics (1996) · Zbl 0862.60002
[37] Van der Vaart, Asymptotic Statistics (2000) · Zbl 0910.62001
[38] DOI: 10.1090/S0002-9947-1950-0051437-7 · doi:10.1090/S0002-9947-1950-0051437-7
[39] Least-Squares Mutual Information (LSMI)http://sugiyama-www.cs.titech.ac.jp/ sugi/software/LSMI/
[40] Efron, Least angle regression, Ann. Stat. 32 pp 407– (2004) · Zbl 1091.62054 · doi:10.1214/009053604000000067
[41] Hastie, The entire regularization path for the support vector machine, J. Mach. Learn. Res. 5 pp 1391– (2004) · Zbl 1222.68213
[42] DOI: 10.1145/959242.959248 · Zbl 05442842 · doi:10.1145/959242.959248
[44] Gretton, A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20 pp 585– (2008)
[45] Steinwart, On the influence of the kernel on the consistency of support vector machines, J. Mach. Learn. Res. 2 pp 67– (2001) · Zbl 1009.68143
[46] Schölkopf, Learning with Kernels (2002)
[47] Efron, An Introduction to the Bootstrap (1993)
[48] Least-Squares Independence Test (LSIT)http://sugiyama-www.cs.titech.ac.jp/ sugi/software/LSIT/
[49] Guyon, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 pp 1157– (2003) · Zbl 1102.68556
[50] Tibshirani, Regression shrinkage and subset selection with the lasso, J. R. Stat. Soc. Series B 58 pp 267– (1996) · Zbl 0850.62538
[51] Boyd, Convex Optimization (2004)
[52] Tomioka, Super-linear convergence of dual augmented Lagrangian algorithm for sparsity regularized estimation, J. Mach. Learn. Res. 12 pp 1537– (2011) · Zbl 1280.68206
[53] 1-Ballhttp://wittawat.com/software/l1lsmi/
[55] Cook, Regression Graphics: Ideas for Studying Regressions through Graphics (1998) · Zbl 0903.62001
[56] DOI: 10.1080/01621459.1991.10475035 · doi:10.1080/01621459.1991.10475035
[57] DOI: 10.1080/01621459.1992.10476258 · doi:10.1080/01621459.1992.10476258
[58] DOI: 10.1080/03610920008832598 · Zbl 1061.62503 · doi:10.1080/03610920008832598
[59] DOI: 10.1214/08-AOS637 · Zbl 1168.62049 · doi:10.1214/08-AOS637
[60] Golub, Matrix Computations (1989)
[61] DOI: 10.1016/j.neucom.2004.11.035 · Zbl 02224090 · doi:10.1016/j.neucom.2004.11.035
[62] DOI: 10.1162/089976698300017746 · doi:10.1162/089976698300017746
[63] DOI: 10.1137/S0895479895290954 · Zbl 0928.65050 · doi:10.1137/S0895479895290954
[64] Patriksson, Nonlinear Programming and Variational Inequality Problems (1999)
[65] Least-Squares Dimensionality Reduction (LSDR)http://sugiyama-www.cs.titech.ac.jp/ sugi/software/LSDR/
[66] DOI: 10.1137/1114019 · doi:10.1137/1114019
[67] Sufficient Component Analysis (SCA)http://sugiyama-www.cs.titech.ac.jp/ yamada/sca.html
[68] DOI: 10.1093/biomet/28.3-4.321 · Zbl 0015.40705 · doi:10.1093/biomet/28.3-4.321
[69] DOI: 10.1038/355161a0 · doi:10.1038/355161a0
[70] Fyfe, Kernel and nonlinear canonical correlation analysis, Int. J. Neural Syst. 10 pp 365– (2000) · Zbl 01963013 · doi:10.1142/S012906570000034X
[73] DOI: 10.1080/01621459.1985.10478157 · doi:10.1080/01621459.1985.10478157
[74] Bach, Kernel independent component analysis, J. Mach. Learn. Res. 3 pp 1– (2002) · Zbl 1088.68689
[75] DOI: 10.1016/S0047-259X(03)00129-5 · Zbl 1058.62049 · doi:10.1016/S0047-259X(03)00129-5
[76] Härdle, Nonparametric and Semiparametric Models (2004) · Zbl 1059.62032
[77] Least-Squares Canonical Dependency Analysis (LSCDA)http://www.bic.kyoto-u.ac.jp/pathway/krsym/software/LSCDA/index.html
[78] Hyvärinen, Independent Component Analysis (2001)
[79] Amari, A New Learning Algorithm for Blind Signal Separation. Advances in Neural Information Processing Systems 8 pp 757– (1996)
[80] DOI: 10.1162/neco.2008.09-06-335 · Zbl 1135.68541 · doi:10.1162/neco.2008.09-06-335
[81] DOI: 10.1016/0165-1684(91)90079-X · Zbl 0729.73650 · doi:10.1016/0165-1684(91)90079-X
[82] DOI: 10.1109/72.761722 · doi:10.1109/72.761722
[83] Least-squares Independent Component Analysishttp://www.simplex.t.u-tokyo.ac.jp/ s-taiji/software/LICA/index.html
[85] Gretton, Measuring Statistical Dependence with Hilbert-Schmidt Norms, Algorithmic Learning Theory pp 63– (2005)
[86] DOI: 10.1109/TPAMI.2009.184 · doi:10.1109/TPAMI.2009.184
[88] DOI: 10.1002/nav.3800020109 · doi:10.1002/nav.3800020109
[89] Least-Squares Object Matching (LSOM)http://sugiyama-www.cs.titech.ac.jp/ yamada/lsom.html
[90] MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability Vol. 1 pp 281– (1967) · Zbl 0214.46201
[91] DOI: 10.1109/TNN.2002.1000150 · doi:10.1109/TNN.2002.1000150
[92] Shi, Normalized cuts and image segmentation, IEEE Trans. Patt. Anal. 22 pp 888– (2000) · Zbl 05111961 · doi:10.1109/34.868688
[93] Ng, On Spectral Clustering: Analysis and An Algorithm. Advances in Neural Information Processing Systems 14 pp 849– (2002)
[94] DOI: 10.1109/TIT.1975.1055330 · Zbl 0297.62025 · doi:10.1109/TIT.1975.1055330
[96] Xu, Maximum Margin Clustering. Advances in Neural Information Processing Systems 17 pp 1537– (2005)
[97] Bach, DIFFRAC: A Discriminative and Flexible Framework for Clustering. Advances in Neural Information Processing Systems 20 pp 49– (2008)
[100] Agakov, Kernelized Infomax Clustering. Advances in Neural Information Processing Systems 18 pp 17– (2006)
[101] Gomes, Discriminative Clustering by Regularized Information Maximization, Advances in Neural Information Processing Systems 23 pp 766– (2010)
[102] Zelnik-Manor, Self-Tuning Spectral Clustering. Advances in Neural Information Processing Systems 17 pp 1601– (2005)
[103] SMI-based Clustering (SMIC)http://sugiyama-www.cs.titech.ac.jp/ sugi/software/SMIC/
[104] Horn, Matrix Analysis (1985)
[105] Pearl, Causality: Models, Reasoning and Inference (2000)
[107] Shimizu, A linear non-gaussian acyclic model for causal discovery, J. Mach. Learn. Res. 7 pp 2003– (2006)
[108] Hoyer, Nonlinear Causal Discovery with Additive Noise Models. Advances in Neural Information Processing Systems 21 pp 689– (2009)
[110] Rasmussen, Gaussian Processes for Machine Learning (2006) · Zbl 1177.68165
[111] Least-Squares Independence Regression (LSIR)http://sugiyama-www.cs.titech.ac.jp/ yamada/lsir.html
[112] Sugiyama, Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation (2012)
[113] DOI: 10.1007/s10115-010-0283-2 · doi:10.1007/s10115-010-0283-2
[114] DOI: 10.1002/sam.10124 · doi:10.1002/sam.10124
[115] Liu, Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation, Structural, Syntactic, and Statistical Pattern Recognition Volume 7626 pp 363– (2012)
[117] DOI: 10.1016/j.neunet.2011.04.003 · Zbl 1414.62311 · doi:10.1016/j.neunet.2011.04.003
[118] DOI: 10.1109/TIT.2011.2163380 · Zbl 1365.62119 · doi:10.1109/TIT.2011.2163380
[119] DOI: 10.1587/transinf.E93.D.2690 · doi:10.1587/transinf.E93.D.2690
[121] DOI: 10.1587/transinf.E93.D.583 · doi:10.1587/transinf.E93.D.583
[122] DOI: 10.1016/j.neunet.2009.07.007 · Zbl 1401.62097 · doi:10.1016/j.neunet.2009.07.007
[123] DOI: 10.1016/j.neunet.2010.10.005 · Zbl 1217.68188 · doi:10.1016/j.neunet.2010.10.005
[125] Yamada, Relative Density-Ratio Estimation for Robust Distribution Comparison, Advances in Neural Information Processing Systems 24 pp 594– (2011)
[127] Softwarehttp://sugiyama-www.cs.titech.ac.jp/ sugi/software/
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.