×

Why you cannot transform your way out of trouble for small counts. (English) Zbl 1415.62153

Summary: While data transformation is a common strategy to satisfy linear modeling assumptions, a theoretical result is used to show that transformation cannot reasonably be expected to stabilize variances for small counts. Under broad assumptions, as counts get smaller, it is shown that the variance becomes proportional to the mean under monotonic transformations \(g(\cdot)\) that satisfy \(g(0)=0\), excepting a few pathological cases. A suggested rule-of-thumb is that if many predicted counts are less than one then data transformation cannot reasonably be expected to stabilize variances, even for a well-chosen transformation. This result has clear implications for the analysis of counts as often implemented in the applied sciences, but particularly for multivariate analysis in ecology. Multivariate discrete data are often collected in ecology, typically with a large proportion of zeros, and it is currently widespread to use methods of analysis that do not account for differences in variance across observations nor across responses. Simulations demonstrate that failure to account for the mean-variance relationship can have particularly severe consequences in this context, and also in the univariate context if the sampling design is unbalanced.

MSC:

62P12 Applications of statistics to environmental and related topics
62J12 Generalized linear models (logistic models)
62H12 Estimation in multivariate analysis

Software:

mvabund; vegan
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Anderson, M. J. (2001). A new method for non‐parametric multivariate analysis of variance. {\it Austral Ecology}26, 32-46.
[2] Anderson, M. J. (2006). Distance‐based tests for homogeneity of multivariate dispersions. {\it Biometrics}62, 245-253. · Zbl 1091.62049
[3] Bartlett, M. S. (1947). The use of transformations. {\it Biometrics}3, 39-52.
[4] Bray, J. R. and Curtis, J. T. (1957). An ordination of the upland forest communities of southern Wisconsin. {\it Ecological Monographs}27, 325-349.
[5] Cameron, A. C. and Trivedi, P. K. (2013). {\it Regression Analysis of Count Data} . Cambridge: Cambridge University Press. · Zbl 1301.62003
[6] Dunn, P. and Smyth, G. (1996). Randomized quantile residuals. {\it Journal of Computational and Graphical Statistics}5, 236-244.
[7] Gijbels, I. and Omelka, M. (2013). Testing for homogeneity of multivariate dispersions using dissimilarity measures. {\it Biometrics}69, 137-145. · Zbl 1270.62078
[8] Hilbe, J. M. (2007). {\it Negative Binomial Regression} . Cambridge: Cambridge University Press. · Zbl 1131.62068
[9] Hui, F. K., Warton, D. I., Ormerod, J. T., Haapaniemi, V. and Taskinen, S. (2017). Variational approximations for generalized linear latent variable models. {\it Journal of Computational and Graphical Statistics}26, 35-43.
[10] Ives, A. R. (2015). For testing the significance of regression coefficients, go ahead and log‐transform count data. {\it Methods in Ecology and Evolution}6, 828-835.
[11] Li, J., Ban, J. and Santiago, L. S. (2011). Nonparametric tests for homogeneity of species assemblages: A data depth approach. {\it Biometrics}67, 1481-1488. · Zbl 1274.62813
[12] McCullagh, P. and Nelder, J. A. (1989). {\it Generalized Linear Models} . London: Chapman & Hall. · Zbl 0744.62098
[13] Miller, R. G., Jr (1986). {\it Beyond ANOVA, Basics of Applied Statistics} . New York: John Wiley & Sons.
[14] Oksanen, J., Blanchet, F. G., Kindt, R., Legendre, P., Minchin, P. R., O’Hara, R. B., et al. (2015). {\it vegan: Community Ecology Package} . R package version 2.3‐2.
[15] Ovaskainen, O., Tikhonov, G., Norberg, A., Guillaume Blanchet, F., Duan, L., Dunson, D., et al. (2017). How to make more out of community data? A conceptual framework and its implementation as models and software. {\it Ecology Letters}20, 561-575.
[16] Reiss, P. T., Stevens, M. H. H., Shehzad, Z., Petkova, E. and Milham, M. P. (2010). On distance‐based permutation tests for between‐group comparisons. {\it Biometrics}66, 636-643. · Zbl 1192.62133
[17] Szöcs, E. and Schäfer, R. B. (2015). Ecotoxicology is not normal. {\it Environmental Science and Pollution Research}22, 13990-13999.
[18] Væth, M. (1985). On the use of Wald’s test in exponential families. {\it International Statistical Review}53, 199-214. · Zbl 0573.62021
[19] Wang, Y., Naumann, U., Wright, S. T. and Warton, D. I. (2012). mvabund—An R package for model‐based analysis of multivariate abundance data. {\it Methods in Ecology and Evolution}3, 471-474.
[20] Warton, D. I. (2011). Regularized sandwich estimators for analysis of high dimensional data using generalized estimating equations. {\it Biometrics}67, 116-123. · Zbl 1216.62186
[21] Warton, D. I., Blanchet, F. G., O’Hara, R., Ovaskainen, O., Taskinen, S., Walker, S. C., et al. (2015). So many variables: Joint modeling in community ecology. {\it Trends in Ecology and Evolution}30, 766-779.
[22] Warton, D. I., Lyons, M., Stoklosa, J. and Ives, A. R. (2016). Three points to consider when choosing a LM or GLM test for count data. {\it Methods in Ecology and Evolution}7, 882-890.
[23] Warton, D. I., Wright, S. T. and Wang, Y. (2012). Distance‐based multivariate analyses confound location and dispersion effects. {\it Methods in Ecology and Evolution}3, 89-101.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.