×

zbMATH — the first resource for mathematics

Semiparametric density testing in the contamination model. (English) Zbl 1434.62048
Summary: In this paper we investigate a semiparametric testing approach to answer if the parametric family allocated to the unknown density of a two-component mixture model with one known component is correct or not. Based on a semiparametric estimation of the Euclidean parameters of the model (free from the null assumption), our method compares pairwise the Fourier’s type coefficients of the model estimated directly from the data with the ones obtained by plugging the estimated parameters into the mixture model. These comparisons are incorporated into a sum of square type statistic which order is controlled by a penalization rule. We prove under mild conditions that our test statistic is asymptotically \(\chi^2_1\)-distributed and study its behavior, both numerically and theoretically, under different types of alternatives including contiguous nonparametric alternatives. We discuss the counterintuitive, from the practitioner point of view, lack of power of the maximum likelihood version of our test in a neighborhood of challenging non-identifiable situations. Several level and power studies are numerically conducted on models close to those considered in the literature, such as in G. J. McLachlan [“A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays”, Bioinform. 22, No. 13, 1608–1615 (2006; doi:10.1093/bioinformatics/btl148)], to validate the suitability of our approach. We also implement our testing procedure on the Carina galaxy real dataset which low luminosity mixes with the one of its companion Milky Way. Finally we discuss possible extensions of our work to a wider class of contamination models.
MSC:
62G07 Density estimation
85A15 Galactic and stellar structure
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P35 Applications of statistics to physics
Software:
logcondens.mode
PDF BibTeX XML Cite
Full Text: DOI Euclid
References:
[1] Allman, E. S., Matias, C. and Rhodes, J. A. (2009) Identifiability of parameters in latent structure models with many observed variables., Ann. Statist., 37, 3099-3132. · Zbl 1191.62003
[2] Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine, A. J. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays., Proc. Natl Acad. Sci. USA, 96, 6745-6750.
[3] Al Mohamad, D. and Boumahdaf, A. (2018) Semiparametric two-component mixture models when one component is defined through linear constraints., IEEE Trans. Information Theory, 64, 795-830. · Zbl 1464.62245
[4] Arias-Castro, E. and Huang, R. (2018) The sparse variance contamination model., Preprint. arXiv:807.10785v1.
[5] Balabdaoui, F. and Doss, C. R. (2018) Inference for a two-component mixture of symmetric distributions under log-concavity., Bernoulli, 24, 1053-1071. · Zbl 1419.62059
[6] Di Zio, M. and Guarnera, U. (2013) A contamination model for selective editing., J. Official Statist., 29, 539-555.
[7] Berrett, T. B., Samworth, R. J., and Yuan, M. (2019) Efficient multivariate entropy estimation via \(k\)-nearest neighbour distances., Ann. Statist. 47, 288-318. · Zbl 07036202
[8] Bordes, L., Delmas, C. and Vandekerkhove, P. (2006) Semiparametric estimation of a two-component mixture model when a component is known., Scand. J. Statist., 33, 733-752. · Zbl 1164.62331
[9] Bordes, L. and Vandekerkhove, P. (2010) Semiparametric two-component mixture model when a component is known: an asymptotically normal estimator., Math. Meth. Statist., 19, 22-41. · Zbl 1282.62068
[10] Dai, H. and Charnigo, R. (2010) Contaminated normal modeling with application to microarray data analysis., Can. J. Statist. 38, 315-332. · Zbl 1233.62024
[11] Doukhan, P., Pommeret, D. and Reboul, L. (2015) Data driven smooth test of comparison for dependent sequences., J. Multivar. Analys., 139, 147-165. · Zbl 1328.62263
[12] Gassiat, E. (2018) Mixtures of nonparametric components and hidden Markov models. Handbook of Mixture Analysis (ed. G. Celeux, S. Fruhwirth-Schnatter, C. Robert, Chap. 12), To appear.
[13] Ghattas, B., Pommeret, D., Reboul, L. and Yao, A. F. (2011) Data driven smooth test for paired populations., J. Stat. Plan. Inference 141, 262-275. · Zbl 1197.62043
[14] Hedenfalk, I., et al. (2001) Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med., 344, 539-548.
[15] Inglot, T., Kallenberg, W. C. M. and Ledwina, T. (1997) Data driven smooth tests for composite hypotheses., Ann. Statist., 25, 1222-1250. · Zbl 0904.62055
[16] Klingenberg, C., Pirner, M. and Puppo, G. (2017) A consistent kinetic model for a two-component mixture with an application to plasma., Kinet. Relat. Models, 10, 445-465. · Zbl 1352.82009
[17] Ledwina, T. (1994) Data-driven version of Neyman’s smooth test of fit., J. Amer. Statist. Assoc. 89, 1000-1005. · Zbl 0805.62022
[18] Lindsay, B. G. (1983) The geometry of mixture likelihoods: a general theory., Ann. Statist., 11, 86-94. · Zbl 0512.62005
[19] Lindsay, B. G. (1989) Moment matrices: applications in mixtures., Ann. Statist., 17, 722-740. · Zbl 0672.62063
[20] Ma, Y. and Yao, W. (2015) Flexible estimation of a semiparametric two-component mixture model with one parametric component., Electr. J. Statist., 9, 444-474. · Zbl 1312.62044
[21] McLachlan, G. J., Bean, R. W. and Ben-Tovim Jones, L. (2006) A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays., Bioinformatics, 22, 1608-1615.
[22] Nguyen, V. H. and Matias, C. (2014) On efficient estimators of the proportion of true null hypotheses in a multiple testing setup., Scan. J. Statist., 41, 1167-1194. · Zbl 1305.62272
[23] Melchior, P. and Goulding, A. D. (2018) Filling the gaps: Gaussian mixture models from noisy, truncated or incomplete samples., Astronomy and Computing, 25, 183-194.
[24] Munk, A., Stockis, J. P., Valeinis, J. and Giese, G. (2010) Neyman smooth goodness-of-fit tests for the marginal distribution of dependent data., Ann. Instit. Statist. Math., 63, 939-959. · Zbl 1441.62224
[25] Neyman, J. (1937) Smooth test for goodness of fit., Skandinavisk Aktuarietidskrift, 20, 149-199. · JFM 63.1092.02
[26] Patra, R. K. and Sen, B. (2016) Estimation of a two-component mixture model with applications to multiple testing., J. Roy. Statist. Soc., Series B, 78, 869-893. · Zbl 1414.62111
[27] Podlaski, R. and Roesch, F. A. (2014) Modelling diameter distributions of two-cohort forest stands with various proportions of dominant species: A two-component mixture model approach., Math. Biosci., 249, 60-74. · Zbl 1309.92071
[28] Quandt, R. E. and Ramsey, J. B. (1978) Estimating mixtures of normal distributions and switching regressions (with comments)., J. Am. Statist. Ass., 73, 730-752. · Zbl 0401.62024
[29] Robin, A. C., Reyl, C., Derrire, S. and Picaud, S. (2003) A synthetic view on structure and evolution of the Milky Way., Astron. Astrophys., 409, 523-540.
[30] Shorack, G. R. and Wellner, J. A. (1986), Empirical Processes with Applications to Statistics. Wiley, New York. · Zbl 1170.62365
[31] Silverman, B. W. (1978) Weak and strong uniform consistency of the kernel estimate of a density and its derivatives., Ann. Statist., 6, 177-184. · Zbl 0376.62024
[32] Suesse, T., Rayner, J. C. W. and Thas, O. (2017) Assessing the fit of finite mixture distributions., Aust. N. Z. J. Stat., 59, 463-483. · Zbl 1384.62040
[33] Szegö, G. (1939), Orthogonal Polynomials. Colloquium Publications Volume XXIII. Amer. Math. Soc. · Zbl 0023.21505
[34] van der Vaart, A. W. (1998), Asymptotic Statistics. Cambridge University Press. · Zbl 0910.62001
[35] Walker, M. G., Mateo, M., Olszewski, E. W., Sen, B. and Woodroofe, M. (2009) Clean kinematic samples in drarf spheroidals: an algorithm for evaluating membership and estimating distribution parameters when contamination is present., The Astronomical Journal, 137, 3109-3138.
[36] van’t Wout, A. B., et al. (2003) Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4+-T-cell lines. J. Virol., 77, 1392-1402.
[37] Wylupek, G. (2010) Data driven K-sample tests., Technometrics, 52, 107-123.
[38] Xiang, S., Yao, W. and Yang, G. (2019) An overview of semiparametric extensions of finite mixture models., Statistical Sciences, 34, 391-404.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.