×

Fast automatic smoothing for generalized additive models. (English) Zbl 1440.68208

Summary: Generalized additive models (GAMs) are regression models wherein parameters of probability distributions depend on input variables through a sum of smooth functions, whose degrees of smoothness are selected by \(L_2\) regularization. Such models have become the de-facto standard nonlinear regression models when interpretability and flexibility are required, but reliable and fast methods for automatic smoothing in large data sets are still lacking. We develop a general methodology for automatically learning the optimal degree of \(L_2\) regularization for GAMs using an empirical Bayes approach. The smooth functions are penalized by hyper-parameters that are learned simultaneously by maximization of a marginal likelihood using an approximate expectation-maximization algorithm. The latter involves a double Laplace approximation at the E-step, and leads to an efficient M-step. Empirical analysis shows that the resulting algorithm is numerically stable, faster than the best existing methods and achieves state-of-the-art accuracy. For illustration, we apply it to an important and challenging problem in the analysis of extremal data.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62J12 Generalized linear models (logistic models)
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] A. Ba, M. Sinn, Y. Goude, and P. Pompey. Adaptive learning of smoothing functions: application to electricity load forecasting. InAdvances in Neural Information Processing Systems 25, pages 2510-2518. USA, 2012.
[2] D. Bates and D. Eddelbuettel.Fast and elegant numerical linear algebra using the RcppEigen package.Journal of Statistical Software, 52(5):1-24, 2013.
[3] L. Breiman and J. H. Friedman. Estimating optimal transformations for multiple regression and correlation.Journal of the American Statistical Association, 80(391):580-598, 1985. · Zbl 0594.62044
[4] J. C. Burkner. brms: An R package for Bayesian multilevel models using Stan.Journal of Statistical Software, 80(1):1-28, 2017.
[5] B. Carpenter, D. Lee, M. A. Brubaker, A. Riddell, A. Gelman, B. Goodrich, J. Guo, M. Hoffman, M. Betancourt, and P. Li. Stan: a probabilistic programming language, 2017.
[6] V. Chavez-Demoulin and A. C. Davison. Generalized additive modelling of sample extremes. Journal of the Royal Statistical Society, Series C, 54(1):207-222, 2005. · Zbl 1490.62194
[7] V. Chavez-Demoulin and A. C. Davison. Modelling time series extremes.Revstat-Statistical Journal, 10:109-133, 2012. · Zbl 1297.62189
[8] W. S. Cleveland, E. Grosse, and W. M. Shyu.Local Regression Models.Chapman & Hall, New York, 1993.
[9] T. J. Cole and P. J. Green. Smoothing reference centile curves: the LMS method and penalized likelihood.Statistics in Medicine, 11(10):1305-1319, 1992.
[10] A. C. Davison and N. I. Ramesh. Local likelihood smoothing of sample extremes.Journal of the Royal Statistical Society, Series B, 62:191-208, 2000. · Zbl 0942.62058
[11] L. de Haan and A. Ferreira.Extreme Value Theory. Springer-Verlag New York, 2006. · Zbl 1101.62002
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion).Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977. · Zbl 0364.62022
[13] D. K. Duvenaud, H. Nickisch, and C. E. Rasmussen. Additive Gaussian processes. In Advances in Neural Information Processing Systems 24, pages 226-234. USA, 2011.
[14] A. C. Faul and M. E. Tipping. Analysis of sparse Bayesian learning. InAdvances in Neural Information Processing Systems 14, pages 383-389. Cambridge, USA, 2001.
[15] R. A. Fisher and L. H. C. Tippett. Limiting forms of the frequency distributions of the largest or smallest member of a sample.Proceedings of the Cambridge Philosophical Society, 24:180-190, 1928. · JFM 54.0560.05
[16] G. H. Golub and C. F. Van Loan.Matrix Computations. The Johns Hopkins University Press, Baltimore, Maryland, 4th edition, 2013. · Zbl 1268.65037
[17] C. Gu. Cross-validating non-Gaussian data.Journal of Computational and Graphical Statistics, 1(2):169-179, 1992.
[18] G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2018.
[19] T. J. Hastie and R. J. Tibshirani. Generalized additive models (with discussion).Statistical Science, 1:297-310, 1986. · Zbl 0955.62603
[20] T. J. Hastie and R. J. Tibshirani.Generalized Additive Models. Chapman & Hall, 1990. · Zbl 0747.62061
[21] T. J. Hastie, R. J. Tibshirani, and J. H. Friedman.The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, 2nd edition, 2009. · Zbl 1273.62005
[22] A. F. Jenkinson. The frequency distribution of the annual maximum (or minimum) values of meteorological elements.Journal of the Royal Meteorological Society, 81:158-171, 1955.
[23] G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines.The Annals of Mathematical Statistics, 41(2): 495-502, 1970. · Zbl 0193.45201
[24] D. J. C. MacKay. Bayesian interpolation.Neural Computation, 4(3):415-447, 1992.
[25] D. J. C. MacKay. Comparison of approximate methods for handling hyperparameters. Neural Computation, 11(5):1035-1068, 1999.
[26] A. McHutchon and C. E. Rasmussen.Gaussian process training with input noise.In Advances in Neural Information Processing Systems 24, pages 1341-1349. USA, 2011.
[27] G. J. McLachlan and T. Krishnan.The EM Algorithm and Extensions (Wiley Series in Probability and Statistics). Wiley-Interscience, 2nd edition, 2008. · Zbl 1165.62019
[28] M. Mutny and A. Krause. Efficient high dimensional Bayesian optimization with additivity and quadrature Fourier features. InAdvances in Neural Information Processing Systems 31, pages 9005-9016. USA, 2018.
[29] R. M. Neal.Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg, 1996. · Zbl 0888.62021
[30] J. A. Nelder and R. W. M. Wedderburn. Generalized linear models.Journal of the Royal Statistical Society, Series A, 135(3):370-384, 1972.
[31] D. Nychka. Bayesian confidence intervals for smoothing splines.Journal of the American Statistical Association, 83(404):1134-1143, 1988.
[32] D. Oakes. Direct calculation of the information matrix via the EM.Journal of the Royal Statistical Society, Series B, 61(2):479-482, 1999. · Zbl 0913.62036
[33] F. O’Sullivan, B. S. Yandell, and W. J. Raynor. Automatic smoothing of regression functions in generalized linear models.Journal of the American Statistical Association, 81(393): 96-103, 1986.
[34] Y. Qi, T. P. Minka, R. W. Picard, and Z. Ghahramani. Predictive automatic relevance determination by expectation propagation. InInternational Conference on Machine Learning, page 85, New York, USA, 2004.
[35] R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019.
[36] P. T. Reiss and R. T. Ogden. Smoothing parameter selection for a class of semiparametric linear models.Journal of the Royal Statistical Society, Series B, 71(2):505-523, 2009. · Zbl 1248.62057
[37] R. A. Rigby and D. M. Stasinopoulos.A semi-parametric additive model for variance heterogeneity.Statistics and Computing, 6(1):57-65, 1996.
[38] R. A. Rigby and D. M. Stasinopoulos. Generalized additive models for location, scale and shape (with discussion).Journal of the Royal Statistical Society, Series C, 54(3):507-554, 2005. · Zbl 1490.62201
[39] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations.Journal of the Royal Statistical Society, Series B, 71(2):319-392, 2009. · Zbl 1248.62156
[40] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting.Journal of the Royal Statistical Society, Series B, 47(1):1-52, 1985. · Zbl 0606.62038
[41] B. M. Steele. A modified EM algorithm for estimation in generalized mixed models.Biometrics, 52(4):1295-1310, 1996. · Zbl 0867.62060
[42] L. Tierney, R. E. Kass, and J. B. Kadane. Fully exponential Laplace approximations to expectations and variances of nonpositive functions.Journal of the American Statistical Association, 84(407):710-716, 1989. · Zbl 0682.62012
[43] M. E. Tipping. The relevance vector machine. InAdvances in Neural Information Processing Systems 12, pages 652-658, Cambridge, USA, 1999.
[44] M. E. Tipping. Sparse Bayesian learning and the relevance vector machine.Journal of Machine Learning Research, 1:211-244, 2001. · Zbl 0997.68109
[45] M. Tsang, H. Liu, S. Purushotham, P. Murali, and Y. Liu. Neural interaction transparency: disentangling learned interactions for improved interpretability. InAdvances in Neural Information Processing Systems 31, pages 5804-5813. USA, 2018.
[46] E. F. Vonesh, H. Wang, L. Nie, and D. Majumdar. Conditional second-order generalized estimating equations for generalized linear and nonlinear mixed-effects models.Journal of the American Statistical Association, 97(457):271-283, 2002. · Zbl 1073.62591
[47] S. N. Wood. Thin plate regression splines.Journal of the Royal Statistical Society, Series B, 65(1):95-114, 2003. · Zbl 1063.62059
[48] S. N. Wood. Fast stable direct fitting and smoothness selection for generalized additive models.Journal of the Royal Statistical Society, Series B, 70(3):495-518, 2008. · Zbl 05563356
[49] S. N. Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models.Journal of the Royal Statistical Society, Series B, 73(1):3-36, 2011. · Zbl 1411.62089
[50] S. N. Wood, Y. Goude, and S. Shaw. Generalized additive models for large data sets. Journal of the Royal Statistical Society, Series C, 64(1):139-155, 2015.
[51] S. N. Wood, N. Pya, and B. Safken. Smoothing parameter and model selection for general smooth models.Journal of the American Statistical Association, 111(516):1548-1563, 2016.
[52] S. N. Wood, Z. Li, G. Shaddick, and N. H. Augustin. Generalized additive models for gigadata: modeling the UK black smoke network daily data.Journal of the American Statistical Association, 112(519):1199-1210, 2017.
[53] T.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.