zbMATH — the first resource for mathematics

Smoothing spline ANOVA for exponential families, with application to the Wisconsin epidemiological study of diabetic retinopathy. (The 1994 Neyman Memorial Lecture). (English) Zbl 0854.62042
Summary: Let $$y_i$$, $$i = 1, \dots, n$$, be independent observations with the density of $$y_i$$ of the form $$h(y_i, f_i) = \exp [y_if_i - b(f_i) + c(y_i)]$$, where $$b$$ and $$c$$ are given functions and $$b$$ is twice continuously differentiable and bounded away from 0. Let $$f_i = f(t(i))$$, where $$t = (t_1, \dots, t_d) \in {\mathcal T}^{(1)} \otimes \cdots \otimes {\mathcal T}^{(d)} = {\mathcal T}$$, the $${\mathcal T}^{(\alpha)}$$ are measurable spaces of rather general form and $$f$$ is an unknown function on $${\mathcal T}$$ with some assumed “smoothness” properties. Given $$\{y_i, t(i), i = 1, \dots, n\}$$, it is desired to estimate $$f(t)$$ for $$t$$ in some region of interest contained in $${\mathcal T}$$.
We develop the fitting of smoothing spline ANOVA models to this data of the form $f(t) = C + \sum_\alpha f_\alpha (t_\alpha) + \sum_{\alpha < \beta} f_{\alpha \beta} (t_\alpha, t_\beta) + \cdots.$ The components of the decomposition satisfy side conditions which generalize the usual side conditions for parametric ANOVA. The estimate of $$f$$ is obtained as the minimizer, in an appropriate function space, of ${\mathcal L} (y,f) + \sum_\alpha \lambda_\alpha J_\alpha (f_\alpha) + \sum_{\alpha < \beta} \lambda_{\alpha \beta} J_{\alpha \beta} (f_{\alpha \beta}) + \cdots,$ where $${\mathcal L} (y,f)$$ is the negative log likelihood of $$y = (y_1, \dots, y_n)'$$ given $$f$$, the $$J_\alpha$$, $$J_{\alpha \beta}, \dots$$ are quadratic penalty functionals and the ANOVA decomposition is terminated in some manner. There are five major parts required to turn this program into a practical data analysis tool:
(1) methods for deciding which terms in the ANOVA decomposition to include (model selection), (2) methods for choosing good values of the smoothing parameters $$\lambda_\alpha$$, $$\lambda_{\alpha \beta}, \dots$$, (3) methods for making confidence statements concerning the estimate, (4) numerical algorithms for the calculations and, finally, (5) public software.
In this paper we carry out this program, relying on earlier work and filling in important gaps. The overall scheme is applied to Bernoulli data from the Wisconsin Epidemiologic Study of Diabetic Retinopathy to model the risk of progression of diabetic retinopathy as a function of glycosylated hemoglobin, duration of diabetes and body mass index. It is believed that the results have wide practical application to the analysis of data from large epidemiologic studies.

MSC:
 62G07 Density estimation 41A15 Spline approximation 62P10 Applications of statistics to biology and medical sciences; meta analysis 65D10 Numerical smoothing, curve fitting 41A63 Multidimensional problems
Software:
GRKPACK; SAS; bootstrap; GCVPACK
Full Text: