×

Global and local two-sample tests via regression. (English) Zbl 1435.62199

The objective of this paper is to report on global and local tests to determine if two samples are from different multivariate distributions. Such tests have applications in a variety of machine learning areas, e.g. to detect differences in healthy and cancerous tissue, in database attribute matching and many other classification and regression problems. Under condition that two populations only differ in their means it is proved that the regression test based on Fisher’s LDA achieves the same local optimality as the Hotelling’s \(T^2\) test. The simulation studies are fulfilled to examine the empirical performance of the proposed tests. The empirical performance of proposed tests is validated at the datasets from Hubble Space Telescope: it is shown that the proposed approach can identify galaxies with specific features of star-forming galaxies.

MSC:

62H15 Hypothesis testing in multivariate analysis
62G10 Nonparametric hypothesis testing
62G20 Asymptotic properties of nonparametric inference
85A15 Galactic and stellar structure
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J05 Linear regression; mixed models
62P35 Applications of statistics to physics
62H35 Image analysis in multivariate analysis

Software:

GeneSrF; hypoRF
PDF BibTeX XML Cite
Full Text: DOI arXiv Euclid

References:

[1] Anderson, J. A. (1972). Separate sample logistic discrimination., Biometrika, 59(1):19-35. · Zbl 0231.62080
[2] Anderson, N. H., Hall, P., and Titterington, D. M. (1994). Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates., Journal of Multivariate Analysis, 50(1):41-54. · Zbl 0798.62055
[3] Anderson, T. W. (2003)., An Introduction to Multivariate Statistical Analysis, volume 3. New York: Wiley-Interscience. · Zbl 1039.62044
[4] Ayano, T. (2012). Rates of convergence for the k-nearest neighbor estimators with smoother regression functions., Journal of Statistical Planning and Inference, 142(9):2530-2536. · Zbl 1428.62149
[5] Baraud, Y. (2002). Non-asymptotic minimax rates of testing in signal detection., Bernoulli, 8(5):577-606. · Zbl 1007.62042
[6] Baringhaus, L. and Franz, C. (2004). On a new multivariate two-sample test., Journal of Multivariate Analysis, 88(1):190-206. · Zbl 1035.62052
[7] Biau, G. (2012). Analysis of a random forests model., Journal of Machine Learning Research, 13(Apr):1063-1095. · Zbl 1283.62127
[8] Biau, G. and Devroye, L. (2015)., Lectures on the Nearest Neighbor Method. Springer. · Zbl 1330.68001
[9] Bickel, P. J. and Li, B. (2007). Local polynomial regression on unknown manifolds., Lecture Notes - Monograph Series, pages 177-186.
[10] Bolthausen, E. (1984). An estimate of the remainder in a combinatorial central limit theorem., Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 66(3):379-386. · Zbl 0563.60026
[11] Breiman, L. (2001). Random forests., Machine Learning, 45(1):5-32. · Zbl 1007.68152
[12] Bühlmann, P. and Van De Geer, S. (2011)., Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science & Business Media.
[13] Bunea, F. and Barbu, A. (2009). Dimension reduction and variable selection in case control studies via regularized likelihood optimization., Electronic Journal of Statistics, 3:1257-1287. · Zbl 1326.62161
[14] Cazáis, F. and Lhéritier, A. (2015). Beyond two-sample-tests: Localizing data discrepancies in high-dimensional spaces. In, IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2015, pages 1-10. IEEE.
[15] Chen, X. and Ishwaran, H. (2012). Random forests for genomic data analysis., Genomics, 99(6):323-329.
[16] Coifman, R. R. and Lafon, S. (2006). Diffusion maps., Applied and Computational Harmonic Analysis, 21(1):5-30. · Zbl 1095.68094
[17] Coifman, R. R., Lafon, S., Lee, A. B., Maggioni, M., Nadler, B., Warner, F., and Zucker, S. W. (2005). Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps., Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426-7431. · Zbl 1405.42043
[18] Conselice, C. J. (2003). The relationship between stellar light distributions of galaxies and their formation histories., The Astrophysical Journal Supplement Series, 147(1):1.
[19] Conselice, C. J. (2014). The evolution of galaxy structure over cosmic time., Annual Review of Astronomy and Astrophysics, 52:291-337.
[20] Cutler, D. R., Edwards, T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J., and Lawler, J. J. (2007). Random forests for classification in ecology., Ecology, 88(11):2783-2792.
[21] Devroye, L., Györfi, L., and Lugosi, G. (2013)., A Probabilistic Theory of Pattern Recognition, volume 31. Springer Science & Business Media.
[22] Díaz-Uriarte, R. and De Andres, S. A. (2006). Gene selection and classification of microarray data using random forest., BMC Bioinformatics, 7(1):3.
[23] Duong, T. (2013). Local significant differences from nonparametric two-sample tests., Journal of Nonparametric Statistics, 25(3):635-645. · Zbl 1416.62243
[24] Fokianos, K. (2008). Comparing two samples by penalized logistic regression., Electronic Journal of Statistics, 2:564-580. · Zbl 1320.62070
[25] Freeman, P., Izbicki, R., Lee, A., Newman, J., Conselice, C., Koekemoer, A., Lotz, J., and Mozena, M. (2013). New image statistics for detecting disturbed galaxy morphologies at high redshift., Monthly Notices of the Royal Astronomical Society, 434(1):282-295.
[26] Friedman, J., Hastie, T., and Tibshirani, R. (2009)., The Elements of Statistical Learning. Springer, New York. · Zbl 1273.62005
[27] Friedman, J. H. (2003). On multivariate goodness of fit and two sample testing., eConf, 30908(SLAC-PUB-10325):311-313.
[28] Gagnon-Bartsch, J. and Shem-Tov, Y. (2016). The classification permutation test: A nonparametric test for equality of multivariate distributions., arXiv preprint arXiv:1611.06408. · Zbl 1434.62061
[29] González-Manteiga, W. and Cao, R. (1993). Testing the hypothesis of a general linear model using nonparametric regression estimation., Test, 2(1-2):161-188. · Zbl 0811.62044
[30] González-Manteiga, W. and Crujeiras, R. M. (2013). An updated review of goodness-of-fit tests for regression models., Test, 22(3):361-411.
[31] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test., Journal of Machine Learning Research, 13(Mar):723-773. · Zbl 1283.62095
[32] Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002)., A Distribution-Free Theory of Nonparametric Regression. Springer Science & Business Media.
[33] Hamza, M. and Larocque, D. (2005). An empirical comparison of ensemble methods based on classification trees., Journal of Statistical Computation and Simulation, 75(8):629-643. · Zbl 1075.62051
[34] Hardle, W. and Mammen, E. (1993). Comparing nonparametric versus parametric regression fits., The Annals of Statistics, 21(4):1926-1947. · Zbl 0795.62036
[35] Hart, J. (2013)., Nonparametric Smoothing and Lack-of-Fit Tests. Springer Science & Business Media. · Zbl 0886.62043
[36] Hediger, S., Michel, L., and Näf, J. (2019). On the use of random forest for two-sample testing., arXiv preprint arXiv:1903.06287.
[37] Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance., Biometrika, 75(4):800-802. · Zbl 0661.62067
[38] Hu, J. and Bai, Z. (2016). A review of 20 years of naive tests of significance for high-dimensional mean vectors and covariance matrices., Science China Mathematics, 59(12):2281-2300. · Zbl 1360.62290
[39] Ingster, Y. I. (1987). Minimax testing of nonparametric hypotheses on a distribution density in the \(L_p\) metrics., Theory of Probability & Its Applications, 31(2):333-337. · Zbl 0629.62049
[40] Keziou, A. and Leoni-Aubin, S. (2005). Test of homogeneity in semiparametric two-sample density ratio models., Comptes Rendus Mathématique, 340(12):905-910. · Zbl 1065.62082
[41] Kim, I., Ramdas, A., Singh, A., and Wasserman, L. (2019). Classification accuracy as a proxy for two sample testing., arXiv preprint arXiv:1602.02210v2.
[42] Kpotufe, S. (2011). k-NN regression adapts to local intrinsic dimension. In, Advances in Neural Information Processing Systems, pages 729-737.
[43] Kpotufe, S. and Garg, V. (2013). Adaptivity to local smoothness and dimension in kernel regression. In, Advances in Neural Information Processing Systems, pages 3075-3083.
[44] Lehmann, E. L. and Romano, J. P. (2006)., Testing Statistical Hypotheses. Springer Science & Business Media. · Zbl 1076.62018
[45] Lopez-Paz, D. and Oquab, M. (2016). Revisiting classifier two-sample tests., arXiv preprint arXiv:1610.06545.
[46] Lotz, J. M., Primack, J., and Madau, P. (2004). A new nonparametric approach to galaxy morphological classification., The Astronomical Journal, 128(1):163.
[47] Mondal, P. K., Biswas, M., and Ghosh, A. K. (2015). On high dimensional two-sample tests based on nearest neighbors., Journal of Multivariate Analysis, 141:168-178. · Zbl 1323.62037
[48] Ojala, M. and Garriga, G. C. (2010). Permutation tests for studying classifier performance., Journal of Machine Learning Research, 11(Jun):1833-1863. · Zbl 1242.62035
[49] Olivetti, E., Greiner, S., and Avesani, P. (2015). Statistical independence for the evaluation of classifier-based diagnosis., Brain Informatics, 2(1):13-19.
[50] Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies., Biometrika, 66(3):403-411. · Zbl 0428.62078
[51] Qin, J. and Zhang, B. (1997). A goodness-of-fit test for logistic regression models based on case-control data., Biometrika, 84(3):609-618. · Zbl 0888.62045
[52] Ramdas, A., Reddi, S. J., Poczos, B., Singh, A., and Wasserman, L. (2015). Adaptivity and computation-statistics tradeoffs for kernel and distance based high dimensional two sample testing., arXiv preprint arXiv:1508.00655.
[53] Rosenblatt, J., Gilron, R., and Mukamel, R. (2016). Better-than-chance classification for signal detection., arXiv preprint arXiv:1608.08873.
[54] Scott, A. J. and Wild, C. (2001). Maximum likelihood for generalised case-control studies., Journal of Statistical Planning and Inference, 96(1):3-27. · Zbl 0976.62105
[55] Snyder, G. F., Torrey, P., Lotz, J. M., Genel, S., McBride, C. K., Vogelsberger, M., Pillepich, A., Nelson, D., Sales, L. V., and Sijacki, D. (2015). Galaxy morphology and star formation in the illustris simulation at \(z=0\)., Monthly Notices of the Royal Astronomical Society, 454(2):1886-1908.
[56] Sugiyama, M., Suzuki, T., Itoh, Y., Kanamori, T., and Kimura, M. (2011). Least-squares two-sample test., Neural Networks, 24(7):735-751. · Zbl 1414.62311
[57] Székely, G. J. and Rizzo, M. L. (2004). Testing for equal distributions in high dimension., InterStat, 5:1-6.
[58] Thas, O. (2010)., Comparing Distributions. Springer. · Zbl 1234.62014
[59] Tsybakov, A. B. (2009)., Introduction to Nonparametric Estimation. Revised and Extended from the 2004 French Original. Translated by Vladimir Zaiats. Springer Series in Statistics. New York: Springer.
[60] Van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso., The Annals of Statistics, 36(2):614-645. · Zbl 1138.62323
[61] Wager, S. and Walther, G. (2015). Adaptive concentration of regression trees, with application to random forests., arXiv preprint arXiv:1503.06388.
[62] Wang, C. and Carroll, R. (1993). On robust estimation in logistic case-control studies., Biometrika, 80(1):237-241. · Zbl 0770.62024
[63] Wang, S. and Carroll, R. J. (1999). High-order accurate methods for retrospective sampling problems., Biometrika, 86(4):881-897. · Zbl 0956.62013
[64] Wasserman, L. (2006)., All of Nonparametric Statistics. Springer Science & Business Media. · Zbl 1099.62029
[65] Weihrather, G. (1993). Testing a linear regression model against nonparametric alternatives., Metrika, 40(1):367-379. · Zbl 0785.62049
[66] Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence., Annals of Statistics, 27(5):1564-1599. · Zbl 0978.62008
[67] Zelnik-Manor, L. and Perona, P. (2005). Self-tuning spectral clustering. In, Advances in Neural Information Processing Systems, pages 1601-1608.
[68] Zhang, C. and Dette, H. (2004). A power comparison between nonparametric regression tests., Statistics & Probability Letters, 66(3):289-301. · Zbl 1102.62049
[69] Zheng, J. X. (1996). A consistent test of functional form via nonparametric estimation techniques., Journal of Econometrics, 75(2):263-289. · Zbl 0865.62030
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.