Multi-species distribution modeling using penalized mixture of regressions. (English) Zbl 1397.62263

Summary: Multi-species distribution modeling, which relates the occurrence of multiple species to environmental variables, is an important tool used by ecologists for both predicting the distribution of species in a community and identifying the important variables driving species co-occurrences. In [“Model based grouping of species across environmental gradients”, Ecol. Model. 222, No. 4, 955–963 (2011; doi:10.1016/j.ecolmodel.2010.11.030)], P. K. Dunstan et al. proposed using finite mixture of regression (FMR) models for multi-species distribution modeling, where species are clustered based on their environmental response to form a small number of “archetypal responses.” As an illustrative example, they applied their mixture model approach to a presence-absence data set of 200 marine organisms, collected along the Great Barrier Reef in Australia. Little attention, however, was given to the problem of model selection – since the archetypes (mixture components) may depend on different but likely overlapping sets of covariates, a method is needed for performing variable selection on all components simultaneously. In this article, we consider using penalized likelihood functions for variable selection in FMR models. We propose two penalties which exploit the grouped structure of the covariates, that is, each covariate is represented by a group of coefficients, one for each component. This leads to an attractive form of shrinkage that allows a covariate to be removed from all components simultaneously. Both penalties are shown to possess specific forms of variable selection consistency, with simulations indicating they outperform other methods which do not take into account the grouped structure. When applied to the Great Barrier Reef data set, penalized FMR models offer more insight into the important variables driving species co-occurrence in the marine community (compared to previous results where no model selection was conducted), while offering a computationally stable method of modeling complex species-environment relationships (through regularization).


62J12 Generalized linear models (logistic models)
62P12 Applications of statistics to environmental and related topics
Full Text: DOI arXiv Euclid


[1] Clark, J. S. (2010). Individuals and the variation needed for high species diversity in forest trees. Science 327 1129-1132.
[2] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1-38. · Zbl 0364.62022
[3] Dunstan, P. K., Foster, S. D. and Darnell, R. (2011). Model based grouping of species across environmental gradients. Ecol. Model. 222 955-963.
[4] Dunstan, P. K., Foster, S. D., Hui, F. K. C. and Warton, D. I. (2013). Finite mixture of regression modeling for high-dimensional count and biomass data in ecology. J. Agric. Biol. Environ. Stat. 18 357-375. · Zbl 1303.62066
[5] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547
[6] Ferrier, S. and Guisan, A. (2006). Spatial modelling of biodiversity at the community level. J. Appl. Ecol. 43 393-404.
[7] Fithian, W. and Hastie, T. (2013). Finite-sample equivalence in statistical models for presence-only data. Ann. Appl. Stat. 7 1917-1939. · Zbl 1283.62225
[8] Follmann, D. A. and Lambert, D. (1991). Identifiability of finite mixtures of logistic regression models. J. Statist. Plann. Inference 27 375-381. · Zbl 0717.62061
[9] Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models . Springer, New York. · Zbl 1108.62002
[10] Grün, B. and Leisch, F. (2008). Identifiability of finite mixtures of multinomial logit models with varying and fixed effects. J. Classification 25 225-247. · Zbl 1276.62021
[11] Hennig, C. (2000). Identifiability of models for clusterwise linear regression. J. Classification 17 273-296. · Zbl 1017.62058
[12] Hui, F. K. C., Warton, D. I., Foster, S. D. and Dunstan, P. K. (2013). To mix or not to mix: Comparing the predictive performance of mixture models versus separate species distribution models. Ecology 94 1913-1919. · Zbl 1303.62066
[13] Hui, F. K. C., Warton, D. I. and Foster, S. D. (2015a). Tuning parameter selection for the adaptive lasso using ERIC. J. Amer. Statist. Assoc. 110 262-269. · Zbl 1373.62370
[14] Hui, F. K. C., Warton, D. I. and Foster, S. D. (2015b). Supplement to “Multi-species distribution modeling using penalized mixture of regressions.” .
[15] Khalili, A. and Chen, J. (2007). Variable selection in finite mixture of regression models. J. Amer. Statist. Assoc. 102 1025-1038. · Zbl 1469.62306
[16] Khalili, A. and Lin, S. (2013). Regularization in finite mixture of regression models with diverging number of parameters. Biometrics 69 436-446. · Zbl 1273.62254
[17] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models . Chapman & Hall, London. · Zbl 0744.62098
[18] McLachlan, G. and Peel, D. (2004). Finite Mixture Models . Wiley, New York. · Zbl 0963.62061
[19] Ovaskainen, O., Hottola, J. and Siitonen, J. (2010). Modeling species co-occurrence by multivariate logistic regression generates new hypotheses on fungal interactions. Ecology 91 2514-2521.
[20] Ovaskainen, O. and Soininen, J. (2011). Making more out of sparse data: Hierarchical modeling of species communities. Ecology 92 289-295.
[21] Pitcher, R. C., Doherty, P. P., Arnold, P. P., Hooper, J. J., Gribble, N. N. et al. (2007). Seabed Biodiversity on the Continental Shelf of the Great Barrier Reef World Heritage Area . CSIRO Marine and Atmospheric Research, Queensland, Australia.
[22] Pollock, L. J., Tingley, R., Morris, W. K., Golding, N., O’Hara, R. B., Parris, K. M., Vesk, P. A. and McCarthy, M. A. (2014). Understanding co-occurrence by modelling species simultaneously with a joint species distribution model (JSDM). Methods in Ecology and Evolution 5 397-406.
[23] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. · Zbl 0379.62005
[24] Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2013). A sparse-group lasso. J. Comput. Graph. Statist. 22 231-245.
[25] Städler, N., Bühlmann, P. and van de Geer, S. (2010). \(\ell_{1}\)-penalization for mixture regression models. TEST 19 209-256. · Zbl 1203.62128
[26] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[27] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91-108. · Zbl 1060.62049
[28] Warton, D. I. and Shepherd, L. C. (2010). Poisson point process models solve the “pseudo-absence problem” for presence-only data in ecology. Ann. Appl. Stat. 4 1383-1402. · Zbl 1202.62171
[29] Wedel, M. and DeSarbo, W. S. (1995). A mixture likelihood approach for generalized linear models. J. Classification 12 21-55. · Zbl 0825.62611
[30] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49-67. · Zbl 1141.62030
[31] Zhang, Y., Li, R. and Tsai, C.-L. (2010). Regularization parameter selections via generalized information criterion. J. Amer. Statist. Assoc. 105 312-323. · Zbl 1397.62262
[32] Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468-3497. · Zbl 1369.62164
[33] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326
[34] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301-320. · Zbl 1069.62054
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.