Nguyen, TrungTin; Nguyen, Hien Duy; Chamroukhi, Faicel; Forbes, Florence A non-asymptotic approach for model selection via penalization in high-dimensional mixture of experts models. (English) Zbl 07603097 Electron. J. Stat. 16, No. 2, 4742-4822 (2022). Summary: Mixture of experts (MoE) are a popular class of statistical and machine learning models that have gained attention over the years due to their flexibility and efficiency. In this work, we consider Gaussian-gated localized MoE (GLoME) and block-diagonal covariance localized MoE (BLoME) regression models to present nonlinear relationships in heterogeneous data with potential hidden graph-structured interactions between high-dimensional predictors. These models pose difficult statistical estimation and model selection questions, both from a computational and theoretical perspective. This paper is devoted to the study of the problem of model selection among a collection of GLoME or BLoME models characterized by the number of mixture components, the complexity of Gaussian mean experts, and the hidden block-diagonal structures of the covariance matrices, in a penalized maximum likelihood estimation framework. In particular, we establish non-asymptotic risk bounds that take the form of weak oracle inequalities, provided that lower bounds for the penalties hold. The good empirical behavior of our models is then demonstrated on synthetic and real datasets. Cited in 1 Document MSC: 62H30 Classification and discrimination; cluster analysis (statistical aspects) 62E17 Approximations to statistical distributions (nonasymptotic) 62H12 Estimation in multivariate analysis Keywords:mixture of experts; linear cluster-weighted models; mixture of regressions; Gaussian locally-linear mapping models; clustering; oracle inequality; model selection; penalized maximum likelihood; block-diagonal covariance matrix; graphical Lasso Software:PRMLT; mixOmics; R; CAPUSHE × Cite Format Result Cite Review PDF Full Text: DOI arXiv Link References: [1] AKAIKE, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19 716-723. · Zbl 0314.62039 · doi:10.1109/TAC.1974.1100705 [2] ANDERSON, D. and BURNHAM, K. (2002). Model Selection and Multi-model Inference, 2 ed. A Pratical Information-Theoretic Approach. Springer-Verlag, New York. · Zbl 1005.62007 · doi:10.1007/b97636 [3] Arlot, S. (2019). Minimal penalties and the slope heuristics: a survey. Journal de la société française de statistique 160 1-106. · Zbl 1437.62121 [4] ARLOT, S. and BACH, F. (2009). Data-driven calibration of linear estimators with minimal penalties. Advances in Neural Information Processing Systems 22. [5] ARLOT, S. and MASSART, P. (2009). Data-driven calibration of penalties for least-squares regression. Journal of Machine Learning Research 10. ISBN: 1532-4435. [6] BARRON, A. R., HUANG, C., LI, J. and LUO, X. (2008). The MDL principle, penalized likelihoods, and statistical risk. Festschrift in Honor of Jorma Rissanen on the Occasion of his 75th Birthday 33-63. [7] BAUDRY, J.-P. (2009). Sélection de modèle pour la classification non supervisée. Choix du nombre de classes., PhD thesis, Université Paris-Sud XI. [8] Baudry, J.-P., Maugis, C. and Michel, B. (2012). Slope heuristics: overview and implementation. Statistics and Computing 22 455-470. · Zbl 1322.62007 [9] BERTIN, K., LE PENNEC, E. and RIVOIRARD, V. (2011). Adaptive Dantzig density estimation. Annales de l’I.H.P. Probabilités et statistiques 47 43-74. Publisher: Gauthier-Villars. · Zbl 1207.62077 · doi:10.1214/09-AIHP351 [10] BIERNACKI, C., CELEUX, G. and GOVAERT, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 719-725. [11] Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probability theory and related fields 138 33-73. · Zbl 1112.62082 [12] BIRGÉ, L., MASSART, P. et al. (1998). Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli 4 329-375. · Zbl 0954.62033 [13] BISHOP, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg. · Zbl 1107.68072 [14] BORWEIN, J. M. and ZHU, Q. J. (2004). Techniques of Variational Analysis. Springer. [15] BOUCHARD, G. (2003). Localised Mixtures of Experts for Mixture of Regressions. In Between Data Science and Applied Data Analysis (M. SCHADER, W. GAUL and M. VICHI, eds.) 155-164. Springer Berlin Heidelberg, Berlin, Heidelberg. · Zbl 05280169 [16] BRINKMAN, N. D. (1981). Ethanol fuel—single-cylinder engine study of efficiency and exhaust emissions. SAE Transactions 1410-1424. [17] BUNEA, F., TSYBAKOV, A. B., WEGKAMP, M. H. and BARBU, A. (2010). SPADES and mixture models. The Annals of Statistics 38 2525-2558. Publisher: Institute of Mathematical Statistics. · Zbl 1198.62025 · doi:10.1214/09-AOS790 [18] BUTUCEA, C., DELMAS, J.-F., DUTFOY, A. and FISCHER, R. (2017). Optimal exponential bounds for aggregation of estimators for the Kullback-Leibler loss. Electronic Journal of Statistics 11 2258-2294. Publisher: Institute of Mathematical Statistics and Bernoulli Society. · Zbl 1364.62082 · doi:10.1214/17-EJS1269 [19] CAMBANIS, S., HUANG, S. and SIMONS, G. (1981). On the theory of elliptically contoured distributions. Journal of Multivariate Analysis 11 368-385. · Zbl 0469.60019 [20] CELEUX, G. and GOVAERT, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition 28 781-793. [21] CHAMROUKHI, F., SAMÉ, A., GOVAERT, G. and AKNIN, P. (2010). A hidden process regression model for functional data description. Application to curve discrimination. Neurocomputing 73 1210-1221. [22] COHEN, S. and LE PENNEC, E. (2011). Conditional density estimation by penalized likelihood model selection and applications. Technical report, INRIA. [23] COHEN, S. X. and LE PENNEC, E. (2013). Partition-based conditional density estimation. ESAIM: Probability and Statistics 17 672-697. · Zbl 1284.62250 [24] DALALYAN, A. S. and SEBBAR, M. (2018). Optimal Kullback-Leibler aggregation in mixture density estimation by maximum likelihood. Mathematical Statistics and Learning 1 1-35. ISBN: 2520-2316. · Zbl 1416.62193 [25] DELEFORGE, A., FORBES, F., BA, S. and HORAUD, R. (2015). Hyper-spectral image analysis with partially latent regression and spatial Markov dependencies. IEEE Journal of Selected Topics in Signal Processing 9 1037-1048. · doi:10.1109/JSTSP.2015.2416677 [26] DELEFORGE, A., FORBES, F. and HORAUD, R. (2015). High-dimensional regression with Gaussian mixtures and partially-latent response variables. Statistics and Computing 25 893-911. · Zbl 1332.62192 · doi:10.1007/s11222-014-9461-5 [27] DEMPSTER, A. P., LAIRD, N. M. and RUBIN, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39 1-38. · Zbl 0364.62022 [28] DEVIJVER, E. (2015). An \[{l_1}\]-oracle inequality for the Lasso in multivariate finite mixture of multivariate Gaussian regression models. ESAIM: PS 19 649-670. · Zbl 1392.62179 · doi:10.1051/ps/2015011 [29] DEVIJVER, E. (2015). Finite mixture regression: a sparse variable selection by model selection for clustering. Electronic Journal of Statistics 9 2642-2674. · Zbl 1329.62279 [30] DEVIJVER, E. (2015). An \[{l_1}\]-oracle inequality for the Lasso in finite mixture of multivariate Gaussian regression. ESAIM: Probability and Statistics 19 649-670. · Zbl 1392.62179 [31] DEVIJVER, E. (2017). Model-based regression clustering for high-dimensional data: application to functional data. Advances in Data Analysis and Classification 11 243-279. · Zbl 1414.62238 [32] DEVIJVER, E. (2017). Joint rank and variable selection for parsimonious estimation in a high-dimensional finite mixture regression model. Journal of Multivariate Analysis 157 1-13. · Zbl 1362.62127 [33] DEVIJVER, E. and GALLOPIN, M. (2018). Block-diagonal covariance selection for high-dimensional Gaussian graphical models. Journal of the American Statistical Association 113 306-314. · Zbl 1398.62020 [34] DEVIJVER, E., GALLOPIN, M. and PERTHAME, E. (2017). Nonlinear network-based quantitative trait prediction from transcriptomic data. arXiv preprint arXiv:1701.07899. [35] DEVIJVER, E. and PERTHAME, E. (2020). Prediction regions through Inverse Regression. Journal of Machine Learning Research 21 1-24. · Zbl 1508.62093 [36] DING, P. (2016). On the conditional distribution of the multivariate \(t\) distribution. The American Statistician 70 293-295. · Zbl 07665887 · doi:10.1080/00031305.2016.1164756 [37] DUISTERMAAT, J. J. and KOLK, J. A. (2004). Multidimensional real analysis I: differentiation 86. Cambridge University Press. · Zbl 1077.26001 [38] FANG, K. T., KOTZ, S. and NG, K. W. (1990). Symmetric Multivariate And Related Distributions. Chapman and Hall. · Zbl 0699.62048 [39] FORBES, F., NGUYEN, H. D., NGUYEN, T. T. and ARBEL, J. (2021). Approximate Bayesian computation with surrogate posteriors. Preprint hal-03139256. [40] FRAHM, G. (2004). Generalized Elliptical Distributions: Theory and Applications. Universität zu Köln. [41] GENOVESE, C. R. and WASSERMAN, L. (2000). Rates of convergence for the Gaussian mixture sieve. The Annals of Statistics 28 1105-1127. · Zbl 1105.62333 · doi:10.1214/aos/1015956709 [42] GOLUB, T. R., SLONIM, D. K., TAMAYO, P., HUARD, C., GAASENBEEK, M., MESIROV, J. P., COLLER, H., LOH, M. L., DOWNING, J. R., CALIGIURI, M. A., BLOOMFIELD, C. D. and LANDER, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science (New York, N.Y.) 286 531-537. · doi:10.1126/science.286.5439.531 [43] HO, N., NGUYEN, X. et al. (2016). Convergence rates of parameter estimation for some weakly identifiable finite mixtures. The Annals of Statistics 44 2726-2755. · Zbl 1359.62076 [44] HO, N., NGUYEN, X. et al. (2016). On strong identifiability and convergence rates of parameter estimation in finite mixtures. Electronic Journal of Statistics 10 271-307. · Zbl 1332.62095 [45] HO, N., YANG, C.-Y. and JORDAN, M. I. (2019). Convergence Rates for Gaussian Mixtures of Experts. arXiv preprint arXiv:1907.04377. [46] HULT, H. and LINDSKOG, F. (2002). Multivariate extremes, aggregation and dependence in elliptical distributions. Advances in Applied Probability 34 587-608. · Zbl 1023.60021 · doi:10.1239/aap/1033662167 [47] INGRASSIA, S., MINOTTI, S. C. and VITTADINI, G. (2012). Local statistical modeling via a cluster-weighted approach with elliptical distributions. Journal of Classification 29 363-401. · Zbl 1360.62335 · doi:10.1007/s00357-012-9114-3 [48] JACOBS, R. A., JORDAN, M. I., NOWLAN, S. J. and HINTON, G. E. (1991). Adaptive mixtures of local experts. Neural Computation 3 79-87. [49] JIANG, W. and TANNER, M. A. (1999). Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation. Annals of Statistics 987-1011. · Zbl 0957.62032 [50] JORDAN, M. I. and JACOBS, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6 181-214. [51] KELKER, D. (1970). Distribution theory of spherical distributions and a location-scale parameter generalization. Sankhy¯a: The Indian Journal of Statistics, Series A (1961-2002) 32 419-430. · Zbl 0223.60008 [52] KINGMA, D. P. and WELLING, M. (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114. Proceedings of the 2nd International Conference on Learning Representations (ICLR). [53] KOSOROK, M. R. (2007). Introduction to empirical processes and semiparametric inference. Springer Science & Business Media. [54] KOTZ, S. and NADARAJAH, S. (2004). Multivariate T-Distributions and Their Applications. Cambridge University Press, Cambridge. · Zbl 1100.62059 · doi:10.1017/CBO9780511550683 [55] LATHUILIÈRE, S., JUGE, R., MESEJO, P., MUÑOZ-SALINAS, R. and HORAUD, R. (2017). Deep mixture of linear inverse regressions applied to head-pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4817-4825. [56] LÊ CAO, K.-A., ROSSOUW, D., ROBERT-GRANIÉ, C. and BESSE, P. (2008). A sparse PLS for variable selection when integrating omics data. Statistical Applications in Genetics and Molecular Biology 7 Article 35. · Zbl 1276.62061 · doi:10.2202/1544-6115.1390 [57] LI, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association 86 316-327. · Zbl 0742.62044 [58] Mallows, C. L. (1973). Some comments on Cp. Technometrics 15 661-675. · Zbl 0269.62061 [59] MASOUDNIA, S. and EBRAHIMPOUR, R. (2014). Mixture of experts: a literature survey. Artificial Intelligence Review 42 275-293. · doi:10.1007/s10462-012-9338-y [60] MASSART, P. (2007). Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003. Springer. · Zbl 1170.60006 [61] MASSART, P. and MEYNET, C. (2011). The Lasso as an \[{l_1}\]-ball model selection procedure. Electronic Journal of Statistics 5 669-687. · Zbl 1274.62468 [62] MAUGIS, C. and MICHEL, B. (2011). A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: Probability and Statistics 15 41-68. · Zbl 1395.62162 [63] MAUGIS, C. and MICHEL, B. (2011). Data-driven penalty calibration: A case study for Gaussian mixture model selection. ESAIM: PS 15 320-339. · Zbl 1395.62163 [64] MAUGIS-RABUSSEAU, C. and MICHEL, B. (2013). Adaptive density estimation for clustering with Gaussian mixtures. ESAIM: Probability and Statistics 17 698-724. · Zbl 1395.62164 [65] MCLACHLAN, G. J. and KRISHNAN, T. (1997). The EM Algorithm and Extensions. Wiley. · Zbl 0882.62012 [66] MCLACHLAN, G. J. and PEEL, D. (2000). Finite Mixture Models. John Wiley & Sons. · Zbl 0963.62061 [67] MENDES, E. F. and JIANG, W. (2012). On convergence rates of mixtures of polynomial experts. Neural Computation 24 3025-3051. · Zbl 1268.62039 [68] MEYNET, C. (2013). An \[{l_1}\]-oracle inequality for the Lasso in finite mixture Gaussian regression models. ESAIM: Probability and Statistics 17 650-671. · Zbl 1395.62166 [69] MOERLAND, P. (1999). Classification using localized mixture of experts. In Ninth International Conference on Artificial Neural Networks 2 838-843. · doi:10.1049/cp:19991216 [70] MONTUELLE, L., LE PENNEC, E. et al. (2014). Mixture of Gaussian regressions model with logistic weights, a penalized maximum likelihood approach. Electronic Journal of Statistics 8 1661-1695. · Zbl 1297.62091 [71] NGUYEN, D. V. and ROCKE, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18 39-50. · doi:10.1093/bioinformatics/18.1.39 [72] NGUYEN, H. D. and CHAMROUKHI, F. (2018). Practical and theoretical aspects of mixture-of-experts modeling: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 e1246. [73] NGUYEN, H. D., CHAMROUKHI, F. and FORBES, F. (2019). Approximation results regarding the multiple-output Gaussian gated mixture of linear experts model. Neurocomputing 366 208-214. · doi:10.1016/j.neucom.2019.08.014 [74] NGUYEN, H. D., LLOYD-JONES, L. R. and MCLACHLAN, G. J. (2016). A universal approximation theorem for mixture-of-experts models. Neural Computation 28 2585-2593. · Zbl 1474.68266 [75] NGUYEN, H. D., NGUYEN, T., CHAMROUKHI, F. and MCLACHLAN, G. J. (2021). Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models. Journal of Statistical Distributions and Applications 8 13. · Zbl 1490.62162 · doi:10.1186/s40488-021-00125-0 [76] NGUYEN, T. (2021). Model Selection and Approximation in High-dimensional Mixtures of Experts Models: from Theory to Practice, PhD Thesis, Normandie Université. [77] NGUYEN, T., CHAMROUKHI, F., NGUYEN, H. and MCLACHLAN, G. (2021). Approximation of probability density functions via location-scale finite mixtures in Lebesgue spaces. Communications in Statistics - Theory and Methods. · Zbl 07710581 · doi:10.1080/03610926.2021.2002360 [78] NGUYEN, T., NGUYEN, H. D., CHAMROUKHI, F. and MCLACHLAN, G. J. (2020). Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Mathematics & Statistics 7 1750861. · Zbl 1486.62048 [79] NGUYEN, T., NGUYEN, H. D., CHAMROUKHI, F. and MCLACHLAN, G. J. (2020). An \[{l_1}\]-oracle inequality for the Lasso in mixture-of-experts regression models. arXiv preprint arXiv:2009.10622. [80] NGUYEN, X. (2013). Convergence of Latent Mixing Measures in Finite and Infinite Mixture Models. The Annals of Statistics 41 370-400. · Zbl 1347.62117 [81] NORETS, A. et al. (2010). Approximation of conditional densities by smooth mixtures of regressions. The Annals of statistics 38 1733-1766. · Zbl 1189.62060 [82] Norets, A. and Pati, D. (2017). Adaptive Bayesian estimation of conditional densities. Econometric Theory 33 980-1012. · Zbl 1441.62095 · doi:10.1017/S0266466616000220 [83] NORETS, A. and PELENIS, J. (2014). Posterior consistency in conditional density estimation by covariate dependent mixtures. Econometric Theory 30 606-646. · Zbl 1296.62083 · doi:10.1017/S026646661300042X [84] PERTHAME, E., FORBES, F. and DELEFORGE, A. (2018). Inverse regression approach to robust nonlinear high-to-low dimensional mapping. Journal of Multivariate Analysis 163 1-14. · Zbl 1408.62119 · doi:10.1016/j.jmva.2017.09.009 [85] RAKHLIN, A., PANCHENKO, D. and MUKHERJEE, S. (2005). Risk bounds for mixture density estimation. ESAIM: PS 9 220-229. · Zbl 1141.62024 · doi:10.1051/ps:2005011 [86] RAMAMURTI, V. and GHOSH, J. (1996). Structural adaptation in mixture of experts. In Proceedings of 13th International Conference on Pattern Recognition 4 704-708 vol.4. · doi:10.1109/ICPR.1996.547656 [87] RAMAMURTI, V. and GHOSH, J. (1998). Use of localized gating in mixture of experts networks. In Proc.SPIE 3390. · doi:10.1117/12.304812 [88] REDNER, R. A. and WALKER, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26 195-239. · Zbl 0536.62021 [89] RIGOLLET, P. (2012). Kullback-Leibler aggregation and misspecified generalized linear models. The Annals of Statistics 40 639-665. Publisher: Institute of Mathematical Statistics. · Zbl 1274.62298 · doi:10.1214/11-AOS961 [90] SATO, M. and ISHII, S. (2000). On-line EM algorithm for the normalized gaussian network. Neural Computation 12 407-432. · Zbl 1473.68164 · doi:10.1162/089976600300015853 [91] SCHWARZ, G. et al. (1978). Estimating the dimension of a model. The Annals of Statistics 6 461-464. · Zbl 0379.62005 [92] STADLER, N., BUHLMANN, P. and VAN DE GEER, S. (2010). \[{l_1}\]-penalization for mixture regression models. TEST 19 209-256. · Zbl 1203.62128 [93] R CORE TEAM (2020). R: A language and environment for statistical computing. Vienna, Austria. [94] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 267-288. · Zbl 0850.62538 [95] TU, C.-C., FORBES, F., LEMASSON, B. and WANG, N. (2019). Prediction with high dimensional regression via hierarchically structured Gaussian mixtures and latent variables. Journal of the Royal Statistical Society: Series C (Applied Statistics) 68 1485-1507. · doi:10.1111/rssc.12370 [96] VAN DE GEER, S. (2000). Empirical Processes in M-estimation 6. Cambridge university press. · Zbl 0953.62049 [97] VAN DER VAART, A. and WELLNER, J. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer 58 59. · Zbl 0862.60002 [98] White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 1-25. · Zbl 0478.62088 · doi:10.2307/1912526 [99] XU, L., JORDAN, M. and HINTON, G. E. (1995). An Alternative Model for Mixtures of Experts. In Advances in Neural Information Processing Systems (G. TESAURO, D. TOURETZKY and T. LEEN, eds.) 7. MIT Press. [100] YOUNG, D. S. (2014). Mixtures of regressions with changepoints. Statistics and Computing 24 265-281. · Zbl 1325.62128 [101] YUKSEL, S. E., WILSON, J. N. and GADER, P. D. (2012). Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems 23 1177-1193. · doi:10.1109/TNNLS.2012.2200299 This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.