×

Local comparison of empirical distributions via nonparametric regression. (English) Zbl 1457.62106

Summary: Given two independent samples of size \(n\) and \(m\) drawn from univariate distributions with unknown densities \(f\) and \(g\), respectively, we are interested in identifying subintervals where the two empirical densities deviate significantly from each other. The solution is built by turning the nonparametric density comparison problem into a comparison of two regression curves. Each regression curve is created by binning the original observations into many small size bins, followed by a suitable form of root transformation to the binned data counts. Turned as a regression comparison problem, several nonparametric regression procedures for detection of sparse signals can be applied. Both multiple testing and model selection methods are explored. Furthermore, an approach for estimating larger connected regions where the two empirical densities are significantly different is also derived, based on a scale-space representation. The proposed methods are applied on simulated examples as well as real-life data from biology.

MSC:

62G08 Nonparametric regression and quantile regression
62G07 Density estimation
62G10 Nonparametric hypothesis testing
62G20 Asymptotic properties of nonparametric inference

Software:

reccv
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Sandve GK, Gundersen S, Rydbeck H, Glad IK, Holden L, Holden M, Liestøl K, Clancy T, Ferkingstad E, Johansen M, Nygård V, Tøstesen E, Frigessi A, Hovig E. The genomic hyperbrowser: inferential genomics at the sequence level. Genome Biol. 2010;11:R121. doi: 10.1186/gb-2010-11-12-r121[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[2] Ambrosi A, Glad I, Pellin D, Cattoglio C, Mavilio F, Di Serio C, Frigessi A. Estimated comparative integration hotspots identify different behaviors of retroviral gene transfer vectors. PLoS Comput Biol. 2011;7(12):e1002292. doi: 10.1371/journal.pcbi.1002292[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[3] Hajék J, Sidák Z. Theory of rank tests. New York: Academic Press; 1967. [Google Scholar] · Zbl 0161.38102
[4] Neuhaus G. H_0-contiguity in nonparametric testing problems and sample Pitman efficiency. Ann Statist. 1982;10:575-582. doi: 10.1214/aos/1176345798[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0492.62040
[5] Neuhaus G. Local asymptotics for linear rank statistics with estimated score functions. Ann Statist. 1987;15:491-512. doi: 10.1214/aos/1176350357[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0632.62045
[6] Fan J. Test of significance based on wavelet thresholding and Neyman’s truncation. J Amer Statist Assoc. 1996;91:674-688. doi: 10.1080/01621459.1996.10476936[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 0869.62032
[7] Janic-Wróblewska A, Ledwina T. Data driven rank test for two-sample problem. Scand J Statist. 2000;27:281-297. doi: 10.1111/1467-9469.00189[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0955.62045
[8] Ducharme GR, Ledwina T. Efficient and adaptive nonparametric test for the two-sample problem. Ann Statist. 2003;31:2036-2058. doi: 10.1214/aos/1074290336[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1065.62079
[9] Duong T, Koch I, Wand MP. Highest density difference region estimation with application to flow cytometric data. Biom J. 2009;51:504-521. doi: 10.1002/bimj.200800201[Crossref], [PubMed], [Web of Science ®], [Google Scholar] · Zbl 1442.62340
[10] Tukey JW. The future of data analysis. Ann Math Stat. 1962;33:1-67. doi: 10.1214/aoms/1177704711[Crossref], [Google Scholar] · Zbl 0107.36401
[11] Cox D, Koh E, Wahba G, Yandell B. Testing the (parametric) null model hypothesis in (semiparametric) partial and generalized spline models. Ann Statist. 1988;16:113-129. doi: 10.1214/aos/1176350693[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0673.62017
[12] King EC, Hart JD, Wehrly TE. Testing the equality of two regression curves using linear smoothers. Statist Probab Lett. 1991;12:239-247. doi: 10.1016/0167-7152(91)90085-6[Crossref], [Web of Science ®], [Google Scholar]
[13] Spokoiny V. Adaptive hypothesis testing using wavelets. Ann Statist. 1996;24:2477-2498. doi: 10.1214/aos/1032181163[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0898.62056
[14] Fan J, Lin SK. Test of significance when data are curves. J Amer Statist Assoc. 1998;93:1007-1021. doi: 10.1080/01621459.1998.10473763[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1064.62525
[15] Shen Q, Faraway J. An F test for linear models with functional responses. Stat Sinica. 2004;14:1239-1257. [Web of Science ®], [Google Scholar] · Zbl 1060.62075
[16] Lee DS. Effective Gaussian mixture learning for video background subtraction. IEEE Trans Pattern Anal Mach Intell. 2005;27:827-832. doi: 10.1109/TPAMI.2005.102[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[17] Abramovich F, Angelini E. Testing in mixed-effects FANOVA models. J Statist Plann Inference. 2006;136:4326-4348. doi: 10.1016/j.jspi.2005.06.002[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1098.62050
[18] Donoho DL. Wavelet shrinkage and W.V.D.: a 10-minute tour. In: Meyer Y, Roques S, editors. Progress in wavelet analysis and applications. Gif-sur-Yvette: Editions Frontieres; 1993. p. 109-128. [Google Scholar] · Zbl 0900.42019
[19] Antoniadis A, Bigot J, Sapatinas T. Wavelet estimators in nonparametric regression: a comparative simulation study. J Statist Softw. 2001;6(6):1-83. [Google Scholar]
[20] Brown LD, Cai TT, Zhang R, Zhao LH, Zhou HH. The root-unroot algorithm for density estimation as implemented via wavelet block thresholding. Probab. Theory Related Fields. 2010;146:401-433. doi: 10.1007/s00440-008-0194-2[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1180.62055
[21] Jones MC. Simple boundary correction for kernel density estimation. Statist Comput. 1993;3:135-146. doi: 10.1007/BF00147776[Crossref], [Web of Science ®], [Google Scholar]
[22] Giné E, Mason DM, Zaitsev AY. The L_1-norm density estimator process. Ann Probab. 2003;31:719-768. doi: 10.1214/aop/1048516534[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1031.62026
[23] Low MG, Zhou HH. A complement to Le Cam’s theorem. Ann Statist. 2007;35:1146-1165. doi: 10.1214/009053607000000091[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1194.62007
[24] Mattick AJR, McClemont J, Irwin JO. The plate count of milk. J Dairy Res. 1935;5:130-147. doi: 10.1017/S002202990000128X[Crossref], [Google Scholar]
[25] Bartlett MS. The square root transformation in analysis of variance. J Roy Statist Soc Suppl. 1936;3:68-78. doi: 10.2307/2983678[Crossref], [Google Scholar] · JFM 63.1085.01
[26] Anscombe FJ. The transformation of Poisson, binomial and negative binomial data. Biometrika. 1948;35:246-254. doi: 10.1093/biomet/35.3-4.246[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0032.03702
[27] Kihlberg JK, Herson JH, Schotz WE. Square root transformation revisited. Appl Statist. 1972;21:76-81. doi: 10.2307/2346609[Crossref], [Web of Science ®], [Google Scholar]
[28] Abramovich F, Heller R. Local functional hypothesis testing. Math Methods Stat. 2005;14(3):253-266. [Google Scholar]
[29] Freedman D, Diaconis P. On the histogram as a density estimator: L_2 theory. Probab Theory Related Fields. 1981;57:453-476. [Crossref], [Web of Science ®], [Google Scholar] · Zbl 0449.62033
[30] Abramovich F, Antoniadis A, Sapatinas T, Vidakovic B. Optimal testing in a fixed-effects functional analysis of variance model. Int J Wavelets Multiresolut Inf Process. 2004;2(4):323-349. doi: 10.1142/S0219691304000639[Crossref], [Google Scholar] · Zbl 1071.62037
[31] Eubank RL, LaRiccia VN. Asymptotic comparison of Cramer von Mises and nonparametric function estimation techniques for testing goodness-of-fit. Ann Statist. 1992;20:2071-2086. doi: 10.1214/aos/1176348903[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0769.62033
[32] Ledwina T. Data driven version of Neyman’s smooth test of fit. J Amer Statist Assoc. 1994;89:1000-1005. doi: 10.1080/01621459.1994.10476834[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 0805.62022
[33] Daubechies I. Ten lectures on wavelets. New York: SIAM; 1992. [Crossref], [Google Scholar] · Zbl 0776.42018
[34] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B. 1995;57:289-300. [Crossref], [Google Scholar] · Zbl 0809.62014
[35] Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29(4):1165-1188. doi: 10.1214/aos/1013699998[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1041.62061
[36] Bunéa F, Wegkamp MH, Auguste A. Consistent variable selection in high dimensional regression via multiple testing. J Statist Plann Inference. 2006;136:4349-4364. doi: 10.1016/j.jspi.2005.03.011[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1112.62062
[37] Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. JASA. 2001;96:1151-1160. doi: 10.1198/016214501753382129[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1073.62511
[38] Efron B, Tibshirani R. Microarrays, empirical Bayes methods, and false discovery rates. Genet Epidemiol. 2002;23(1):70-86. doi: 10.1002/gepi.1124[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[39] Huet S. Comparison of methods for estimating the nonzero components of a Gaussian vector. Technical report. INRA, MIA-Jouy; 2005. Available from: www.inra.fr/miaj/apps/cgi-bin/raptech.cgi[Google Scholar]
[40] Huet S. Model selection for estimating the non zero components of a Gaussian vector. ESAIM Probab Stat. 2006;10:164-183. doi: 10.1051/ps:2006004[Crossref], [Google Scholar] · Zbl 1187.62103
[41] Birgé L, Massart P. Gaussian model selection. J Eur Math Soc. 2001;3:203-268. doi: 10.1007/s100970100031[Crossref], [Google Scholar] · Zbl 1037.62001
[42] Meyer Y. Wavelets and operators. Cambridge Studies in Advanced Math., vol. 37. Cambridge: Cambridge University Press; 1992. [Google Scholar] · Zbl 0776.42019
[43] Mallat S. A wavelet tour of signal processing. 2nd ed.San Diego, CA: Academic Press; 1999. [Google Scholar] · Zbl 0998.94510
[44] Antoniadis A. Wavelets in statistics: a review (with discussion). J Ital Statist Soc. 1997;6:97-130. doi: 10.1007/BF03178905[Crossref], [Google Scholar] · Zbl 1454.62113
[45] Vidakovic B. Statistical modeling by wavelets. New York: John Wiley & Sons, Inc.; 1999. [Google Scholar] · Zbl 0924.62032
[46] Abramovich F, Bailey T, Sapatinas T. Wavelet analysis and its statistical applications. Statistician D. 2000;49:1-29. [Crossref], [Google Scholar]
[47] Antoniadis A, Fan J. Regularization of wavelets approximations (with discussion). J Amer Statist Assoc. 2001;96:939-967. doi: 10.1198/016214501753208942[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 1072.62561
[48] Abramovich F, Benjamini Y, Donoho D, Johnstone I. Adapting to unknown sparsity by controlling the false discovery rate. Ann Statist. 2006;34:584-653. doi: 10.1214/009053606000000074[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1092.62005
[49] Donoho DL, Johnstone JM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425-455. doi: 10.1093/biomet/81.3.425[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0815.62019
[50] Evers L, Heaton TJ. Locally-adaptive tree-based thresholding. J Comput Graph Stat. 2009;18(4):961-977. doi: 10.1198/jcgs.2009.07109[Taylor & Francis Online], [Web of Science ®], [Google Scholar]
[51] Chaudhuri, P, Marron, JS. Scale space view of curve estimation. Ann Statist. 2000;28:408-428. doi: 10.1214/aos/1016218224[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1106.62318
[52] Mallat S, Hwang WL. Singularity detection and processing with wavelets. IEEE Trans Inform Theory. 1992;38:617-643. doi: 10.1109/18.119727[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0745.93073
[53] Bigot J. A scale-space approach with wavelets to singularity estimation. ESAIM Probab Stat. 2005;9:143-164. doi: 10.1051/ps:2005007[Crossref], [Google Scholar] · Zbl 1136.62030
[54] Bigot J. Landmark-based registration of curves via the continuous wavelet transform. J Comput Graph Stat. 2006;15(3):542-564. doi: 10.1198/106186006X133023[Taylor & Francis Online], [Web of Science ®], [Google Scholar]
[55] Nemani VM, Lu W, Berge V, Nakamura K, Onoa B, Lee MK, Chaudhry FA, Nicoll RA, Edwards RH. Increased expression of a-synuclein reduces neurotransmitter release by inhibiting synaptic vesicle reclustering after endocytosis. Neuron. 2010;65:66-79. doi: 10.1016/j.neuron.2009.12.023[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[56] Giné E, Nickl R. Confidence bands in density estimation. Ann Statist. 2010;38:1122-1170. doi: 10.1214/09-AOS738[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1183.62062
[57] Roederer M, Moore W, Treister A, Hardy RR, Herzenberg LA. Probability binning comparison: a metric for quantitating multivariate distribution differences. Cytometry. 2001;45:47-55. doi: 10.1002/1097-0320(20010901)45:1<47::AID-CYTO1143>3.0.CO;2-A[Crossref], [PubMed], [Google Scholar]
[58] Friedman JH, Fisher NI. Bump-hunting for high dimensional data. Stat Comput. 1999;9:123-143. doi: 10.1023/A:1008894516817[Crossref], [Web of Science ®], [Google Scholar]
[59] Ingster YuI. Minimax nonparametric detection of signals in white Gaussian noise. Probl Inform Transm. 1982;18:130-140. [Google Scholar] · Zbl 0499.94002
[60] Ingster YuI. Asymptotically minimax hypothesis testing for nonparametric alternatives I, II, III. Math Methods Statist. 1993;2:85-114, 171-189, 249-268. [Google Scholar] · Zbl 0798.62059
[61] Ermakov MS. Minimax detection of a signal in a white Gaussian noise. Theory Probab Appl. 1990;35:667-679. doi: 10.1137/1135098[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0744.62117
[62] Lepski O, Spokoiny V. Minimax nonparametric hypothesis testing: the case of an inhomogeneous alternative. Bernoulli. 1999;5:333-358. doi: 10.2307/3318439[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0946.62050
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.