Hypothesis test for normal mixture models: the EM approach.

*(English)*Zbl 1173.62007Summary: Normal mixture distributions are arguably the most important mixture models, and also the most technically challenging. The likelihood function of normal mixture models is unbounded based on a set of random samples, unless an artificial bound is placed on its components variance parameters. Moreover, the model is not strongly identifiable so it is hard to differentiate between overdispersion caused by the presence of a mixture and that caused by a large variance, and has infinite Fisher information with respect to mixing proportions. There has been extensive research on finite normal mixture models, but much of it addresses merely consistency of the point estimation or useful practical procedures, and many results require undesirable restrictions on the parameter space.

We show that an EM-test for homogeneity is effective at overcoming many challenges in the context of finite normal mixtures. We find that the limiting distribution of the EM-test is a simple function of the \(0.5\chi _{0}^{2}+0.5\chi _{1}^{2}\) and \(\chi _{1}^{2}\) distributions when the mixing variances are equal but unknown and the \(\chi _{2}^{2}\) when variances are unequal and unknown. Simulations show that the limiting distributions approximate the finite sample distribution satisfactorily. Two genetic examples are used to illustrate the application of the EM-test.

We show that an EM-test for homogeneity is effective at overcoming many challenges in the context of finite normal mixtures. We find that the limiting distribution of the EM-test is a simple function of the \(0.5\chi _{0}^{2}+0.5\chi _{1}^{2}\) and \(\chi _{1}^{2}\) distributions when the mixing variances are equal but unknown and the \(\chi _{2}^{2}\) when variances are unequal and unknown. Simulations show that the limiting distributions approximate the finite sample distribution satisfactorily. Two genetic examples are used to illustrate the application of the EM-test.

##### MSC:

62F03 | Parametric hypothesis testing |

65C60 | Computational problems in statistics (MSC2010) |

62E20 | Asymptotic distribution theory in statistics |

62F05 | Asymptotic properties of parametric tests |

62P10 | Applications of statistics to biology and medical sciences; meta analysis |

##### Keywords:

chi-square limiting distribution; compactness; normal mixture models; homogeneity test; likelihood ratio test; statistical genetics##### References:

[1] | Cai, T., Jin, J. and Low, M. (2007). Estimation and confidence sets for sparse normal mixtures. Ann. Statist. 35 2421-2449. · Zbl 1360.62113 |

[2] | Charnigo, R. and Sun, J. (2004). Testing homogeneity in a mixture distribution via the L 2 -distance between competing models. J. Amer. Statist. Assoc. 99 488-498. · Zbl 1117.62307 |

[3] | Chen, H. and Chen, J. (2003). Tests for homogeneity in normal mixtures with presence of a structural parameter. Statist. Sinica 13 351-365. · Zbl 1015.62015 |

[4] | Chen, H., Chen, J. and Kalbfleisch, J. D. (2001). A modified likelihood ratio for homogeneity in finite mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 63 19-29. · Zbl 0976.62011 |

[5] | Chen, J. (1995). Optimal rate of convergence in finite mixture models. Ann. Statist. 23 221-234. · Zbl 0821.62023 |

[6] | Chen, J. and Kalbfleisch, J. D. (2005). Modified likelihood ratio test in finite mixture models with a structural parameter. J. Statist. Plann. Inference 129 93-107. · Zbl 1058.62020 |

[7] | Chen, J. and Li, P. (2008). Homogeneity test in normal mixture models: The EM approach. Technical report, Univ. British Columbia. |

[8] | Chen, J., Tan, X. and Zhang, R. (2008). Inference for normal mixtures in mean and variance. Statist. Sinica . 18 443-465. · Zbl 1135.62018 |

[9] | Dacunha-Castelle, D. and Gassiat, E. (1999). Testing the order of a model using locally conic parametrization: Population mixtures and stationary ARMA processes. Ann. Statist. 27 1178-1209. · Zbl 0957.62073 |

[10] | Efron, B. (2004). Large-scale simulation hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96-104. · Zbl 1089.62502 |

[11] | Everitt, B. S. (1996). An introduction to finite mixture distributions. Statist. Methods Med. Research 5 107-127. |

[12] | Everitt, B. S., Landau, S. and Leese, M. (2001). Cluster Analysis , 4th ed. Oxford Univ. Press, New York, NY. · Zbl 1205.62076 |

[13] | Feng, Z. D. and McCulloch, C. E. (1994). On the likelihood ratio test statistic for the number of components in a normal mixture with unequal variances. Biometrics 50 1158-1162. · Zbl 0825.62375 |

[14] | Garel, B. (2005). Asymptotic theory of the likelihood ratio test for the identification of a mixture. J. Statist. Plann. Inference 131 271-296. · Zbl 1061.62028 |

[15] | Ghosh, J. K. and Sen, P. K. (1985). On the asymptotic performance of the log-likelihood ratio statistic for the mixture model and related results. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (L. LeCam and R. A. Olshen, eds.) 2 789-806. Wadsworth, Monterey, CA. · Zbl 1373.62075 |

[16] | Hall, P. and Stewart, M. (2005). Theoretical analysis of power in a two-component normal mixture model. J. Statist. Plann. Inference 134 158-179. · Zbl 1066.62031 |

[17] | Hartigan, J. A. (1985). A failure of likelihood asymptotics for normal mixtures. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (L. LeCam and R. A. Olshen, eds.) 2 807-810. Wadsworth, Monterey, CA. · Zbl 1373.62070 |

[18] | Hathaway, R. J. (1985). A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann. Statist. 13 795-800. · Zbl 0576.62039 |

[19] | Kon, S. (1984). Models of stock returns-A comparison. J. Finance 39 147-165. |

[20] | Levine, R. (1981). Sex differences in schizophrenia: Timing or subtypes? Psychological Bulletin 90 432-444. |

[21] | Li, P., Chen, J. and Marriott, P. (2008). Nonfinite Fisher information and homogeneity: The EM approach. Biometrika . |

[22] | Liu, X., Pasarica, C. and Shao, Y. (2003). Testing homogeneity in gamma mixture models. Scand. J. Statist. 30 227-239. · Zbl 1034.62010 |

[23] | Liu, X. and Shao, Y. Z. (2004). Asymptotics for likelihood ratio tests under loss of identifiability. Ann. Statist. 31 807-832. · Zbl 1032.62014 |

[24] | Liu, X. and Shao, Y. Z. (2004). Asymptotics for the likelihood ratio test in a two-component normal mixture model. J. Statist. Plann. Inference 123 61-81. · Zbl 1050.62025 |

[25] | Loisel, P., Goffinet, B., Monod, H. and Montes De Oca, G. (1994). Detecting a major gene in an F2 population. Biometrics 50 512-516. · Zbl 0825.62767 |

[26] | MacKenzie, S. A. and Bassett, M. J. (1987). Genetics of fertility restoration in cytoplasmic sterile Phaseolus vulgaris L. I. Cytoplasmic alteration by a nuclear restorer gene. Theoretical and Applied Genetics 74 642-645. |

[27] | Marriott, P. (2007). Extending local mixture models. Ann. Inst. Statist. Math. 59 95-110. · Zbl 1108.62004 |

[28] | McLachlan, G. J. (1987). On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics 36 318-324. |

[29] | McLachlan, G. J., Bean, R. W. and Ben-Tovim Jones, L. (2006). A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 22 1608-1615. |

[30] | McLaren, C. E. (1996). Mixture models in haematology: A series of case studies. Stat. Methods Med. Res. 5 129-153. |

[31] | Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London A 185 71-110. · JFM 25.0347.02 |

[32] | Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering. J. Amer. Statist. Assoc. 101 168-178. · Zbl 1118.62339 |

[33] | Roeder, K. (1994). A graphical technique for determining the number of components in a mixture of normals. J. Amer. Statist. Assoc. 89 487-495. · Zbl 0798.62004 |

[34] | Schork, N. J., Allison, D. B. and Thiel, B. (1996). Mixture distributions in human genetics research. Stat. Methods Med. Res. 5 155-178. |

[35] | Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc. 102 901-912. · Zbl 05564419 |

[36] | Tadesse, M., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data. J. Amer. Statist. Assoc. 100 602-617. · Zbl 1117.62433 |

[37] | Wolfe, J. H. (1971). A Monte Carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Technical Bulletin STB 72-2, Naval Personnel and Training Research Laboratory, San Diego. |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.