×

Relative density-ratio estimation for robust distribution comparison. (English) Zbl 1414.62115

Summary: Divergence estimators based on direct approximation of density ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is a challenging task in practice. In this letter, we use relative divergences for distribution comparison, which involves approximation of relative density ratios. Since relative density ratios are always smoother than corresponding ordinary density ratios, our proposed method is favorable in terms of nonparametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposedapproach.

MSC:

62G05 Nonparametric estimation

Software:

LIBSVM; bootstrap
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28, 131-142. · Zbl 0203.19902
[2] Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337-404. , · Zbl 0037.20701
[3] Bao, L., & Intille, S. S. (2004). Activity recognition from user-annotated acceleration data. In Proceedings of the 2nd IEEE International Conference on Pervasive Computing (pp. 1-17). Piscataway, NJ: IEEE. ,
[4] Bharatula, N. B., Stager, M., Lukowicz, P., & Troster, G. (2005). Empirical study of design choices in multi-sensor context recognition systems. In Proceedings of the 2nd International Forum on Applied Wearable Computing (pp. 79-93). Berlin: Springer-Verlag.
[5] Bickel, S., Bogojeska, J., Lengauer, T., & Scheffer, T. (2008). Multi-task learning for HIV therapy screening. In A. McCallum & S. Roweis (Eds.), Proceedings of 25th Annual International Conference on Machine Learning (ICML2008) (pp. 56-63). Madison, WI: Omnipress. ,
[6] Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., & Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14), e49-e57. ,
[7] Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145-1159. ,
[8] Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[9] Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2006). Semi-supervised learning. Cambridge, MA: MIT Press. ,
[10] Cortes, C., Mansour, Y., & Mohri, M. (2010). Learning bounds for importance weighting. In J. Lafferty, C.K.I. Williams, R. Zemel, J. Shawe-Taylor, & A. Culotta (Eds.), Advances in neural information processing systems 23 (pp. 442-450). Red Hook, NY: Curran.
[11] Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2, 229-318. · Zbl 0157.25802
[12] Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall. , · Zbl 0835.62038
[13] Fishman, G. S. (1996). Monte Carlo: Concepts, algorithms, and applications. Berlin: Springer-Verlag. , · Zbl 0859.65001
[14] Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., & Smola, A. J. (2007). A kernel method for the two-sample-problem. In B. Schölkopf, J. Platt, & T. Hoffman (Eds.), Advances in neural information processing systems, 19 (pp. 513-520). Cambridge, MA: MIT Press.
[15] Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2011). Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 26(2), 309-336. ,
[16] Jiang, J., & Zhai, C. (2007). Instance weighting for domain adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (pp. 264-271). Stroudsburg, PA: Association for ComputationalLinguistics.
[17] Kain, A., & Macon, M. W. (1998). Spectral voice conversion for text-to-speech synthesis. In Proceedings of 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 285-288). Piscataway, NJ: IEEE. ,
[18] Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach todirect importance estimation. Journal of Machine Learning Research, 10, 1391-1445. · Zbl 1235.62039
[19] Kimura, M., & Sugiyama, M. (2011). Dependence-maximization clustering with least-squares mutual information. Journal of Advanced Computational Intelligence and Intelligent Informatics, 15, 800-805.
[20] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79-86. , · Zbl 0042.38403
[21] Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), 5847-5861. , · Zbl 1366.62071
[22] Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359. ,
[23] Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157-175. , · JFM 31.0238.04
[24] Rätsch, G., Onoda, T., & Müller, K.-R. (2001). Soft margins for Adaboost. Machine Learning, 42(3), 287-320. , · Zbl 0969.68128
[25] Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ: Princeton University Press. , · Zbl 0193.18401
[26] Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443-1471. , · Zbl 1009.62029
[27] Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2), 227-244. , · Zbl 0958.62011
[28] Simm, J., Sugiyama, M., & Kato, T. (2011). Computationally efficient multi-task learning with least-squares probabilistic classifiers. IPSJ Transactions on Computer Vision and Applications, 3, 1-8. ,
[29] Smola, A., Song, L., & Teo, C. H. (2009). Relative novelty detection. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS2009) (pp. 536-543). Brookline, MA: Microtome.
[30] Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, G., & Schölkopf, B. (2009). Kernel choice and classifiability for RKHS embeddings of probability distributions. In Y. Bengio, D. Schuurmans, J. Lafferty, C.K.I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems, 22 (pp. 1750-1758). Cambridge, MA: MIT Press.
[31] Steinwart, I., & Scovel, C. (2007). Fast rates for support vector machines using gaussian kernels. Annals of Statistics, 35(2), 575-607. , · Zbl 1127.68091
[32] Sugiyama, M. (2010). Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D(10), 2690-2701. ,
[33] Sugiyama, M., & Kawanabe, M. (2012). Covariate shift adaptation: Toward machine learning in non-stationary environments. Cambridge, MA: MIT Press. ,
[34] Sugiyama, M., Krauledat, M., & Müller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8, 985-1005. · Zbl 1222.68313
[35] Sugiyama, M., & Müller, K.-R. (2005). Input-dependent estimation of generalization error under covariate shift. Statistics and Decisions, 23(4), 249-279. , · Zbl 1117.62069
[36] Sugiyama, M., & Suzuki, T. (2011). Least-squares independence test. IEICE Transactions on Information and Systems, E94-D(6), 1333-1336. ,
[37] Sugiyama, M., Suzuki, T., Itoh, Y., Kanamori, T., & Kimura, M. (2011). Least-squares two-sample test. Neural Networks, 24(7), 735-751. , · Zbl 1414.62311
[38] Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60, 699-746. , · Zbl 1294.62069
[39] Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., & Okanohara, D. (2010). Least-squares conditional density estimation. IEICE Transactions on Information and Systems, E93-D(3), 583-594. ,
[40] Suzuki, T., & Sugiyama, M. (2010). Sufficient dimension reduction via squared-loss mutual information estimation. In Y. W. Teh, & M. Tiggerington (Eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS2010) (pp. 804-811). London: BioMed Central.
[41] Suzuki, T., & Sugiyama, M. (2011). Least-squares independent component analysis. Neural Computation, 23(1), 284-301. , · Zbl 1205.94040
[42] Suzuki, T., Sugiyama, M., Kanamori, T., & Sese, J. (2009). Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics, 10(1), S52. ,
[43] van de Geer, S. (2000). Empirical processes in M-estimation. Cambridge: Cambridge University Press. · Zbl 1179.62073
[44] van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge: Cambridge University Press. · Zbl 0910.62001
[45] van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes: With applications to statistics. New York: Springer. , · Zbl 0862.60002
[46] Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. · Zbl 0935.62007
[47] Wahba, G. (1990). Spline model for observational data. Philadelphia: Society for Industrial and Applied Mathematics. , · Zbl 0813.62001
[48] Yamada, M., & Sugiyama, M. (2010). Dependence minimizing regression with model selection for non-linear causal inference under non-gaussian noise. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI2010) (pp. 643-648). Palo Alto, CA: AAAI Press.
[49] Yamada, M., & Sugiyama, M. (2011). Cross-domain object matching with model selection. In G. Gordon, D. Dunson, & M. Dudík (Eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. Brookline, MA: Microtome.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.