
What is a statistical model? (With comments and rejoinder).

Summary: This paper addresses two closely related questions, “What is a statistical model?” and “What is a parameter?” The notions that a model must “make sense”, and that a parameter must “have a well-defined meaning” are deeply ingrained in applied statistical work, reasonably well understood at an instinctive level, but absent from most formal theories of modelling and inference. In this paper, these concepts are defined in algebraic terms, using morphisms, functors and natural transformations.
It is argued that inference on the basis of a model is not possible unless the model admits a natural extension that includes the domain for which inference is required. For example, prediction requires that the domain includes all future units, subjects or time points. Although it is usually not made explicit, every sensible statistical model admits such an extension.
Examples are given to show why such an extension is necessary and why a formal theory is required. In the definition of a subparameter, it is shown that certain parameter functions are natural and others are not. Inference is meaningful only for natural parameters. This distinction has important consequences for the construction of prior distributions and also helps to resolve a controversy concerning the Box-Cox model.


spatial effects are often of secondary importance, as in variety trials, and the main intention is to absorb an appropriate level of spatial variation in the formulation, rather than produce a spatial model with scientifically interpretable parameters. Nevertheless, McCullagh's basic point is well taken. For example, I view the use of MRFs in geographical epidemiology [e.g., Besag, York and Mollié (1991)] as mainly of exploratory value, in suggesting additional spatially related covariates whose inclusion would ideally dispense with the need for a spatial formulation;
uniformity trials in Fairfield Smith (1938) and Pearce (1976). Of course, in a genuine variety trial, one might want to predict what the aggregate yield over the entire field would have been for a few individual varieties but this does not require any extension of the formulation to McCullagh's conceptual plots. Indeed, such calculations are especially well suited to the Bayesian paradigm, both theoretically, because one is supposed to deal with potentially observable quantities rather than merely with parameters, and in practice, via MCMC, because the posterior predictive distributions are available rigorously. That is, for the aggregate yield of variety A, one uses the observed yields on plots that were sown with A and generates a set of observations from the likelihood for those that were not for each MCMC sample of parameter values, hence building a corresponding distribution of total yield. One may also construct credible intervals for the difference in total yields between varieties A and B and easily address all manner of questions in ranking and selection that simply cannot be considered in a frequentist framework; for example, the posterior probability that the total yield obtained by sowing any particular variety (perhaps chosen in the light of the experiment) would have been at least 10
ton (1986). The findings typically suggest that the gains from spatial analysis in a badly designed experiment provide improvements commensurate with standard analysis and optimal design. This is not a reason to adopt poor designs but the simple fact is that, despite the efforts of statisticians, many experiments are carried out using nothing better than randomized complete blocks. It is highly desirable that the representation of fertility is flexible but is also parsimonious because there are many variety effects to be estimated, with very limited replication. McCullagh's use of discrete approximations to harmonic functions in Section 8 fails on both counts: first, local maxima or minima cannot exist except (artificially) at plots on the edge of the trial; second, the degrees of freedom lost in the fit equals the number of such plots and is therefore substantial (in fact, four less in a rectangular layout because the corner plots are ignored throughout the analysis!). Nevertheless, there is something appealing about the averaging property of harmonic functions, if only it were a little more flexible. What is required is a random effects (in frequentist terms) version and that is precisely the thinking behind the use of intrinsic autoregressions in BH and elsewhere. Indeed, such schemes fit McCullagh's discretized harmonic functions perfectly, except for edge effects (because BH embeds the array in a larger one to cater for such effects), and they also provide a good fit to more plausible fertility functions. For specific comments on the Mercer and Hall data, see below. Of course, spatial scale remains an important issue for variety trials and indeed is discussed empirically in Section 2.3 and in the rejoinder of BH. For one-dimensional adjustment, the simplest plausible continuum process is Brownian motion with an arbitrary level, for which the necessary integrations can be
SEATTLE, WASHINGTON 98195-4322 E-MAIL: julian@stat.washington.edu recently by Chen, Lockhart and Stephens (2002). One reason for its attractiveness to me is that if one considers the more realistic semiparametric model, a(Y) = X +, (6) where a is an arbitrary monotone transformation and has a N (\mu , 2) distribution then / is identifiable and estimable at the n-1/2 rate while is not identifiable. Bickel and Ritov (1997) discuss ways of estimating / and a which is also estimable at rate n-1/2 optimally and suggest approaches to algorithms in their paper. The choice (,) is of interest to me because its consideration is the appropriate response to the Hinkley-Runger critique. One needs to specify a joint confidence region for (,) making statements such as "the effect magnitude on the scale is consistent with the data." The effect of lack of knowledge of on the variance of remains interpretable. It would be more attractive if McCullagh could somehow divorce the calculus of this paper from the language of functors, morphisms and canonical diagrams for more analysis-oriented statisticians such as myself.
BERKELEY, CALIFORNIA 94720-3860 E-MAIL: bickel@stat.berkeley.edu
TORONTO, ONTARIO M5S 3G3 CANADA E-MAIL: reid@utstat.utoronto.ca from Helland (2002). Let a group G be defined on the parameter space of a model. A measurable function from to another space is called a natural subparameter if ( 1) = ( 2) implies (g 1) = (g 2) for all g G. For example, in the location and scale case the location parameter \mu and the scale parameter are natural, while the coefficient of variation \mu / is not natural (it is if the group is changed to the pure scale group). In general the parameter is natural iff the level sets of the function = () are transformed onto other inconsistency discussed in detail by Dawid, Stone and Zidek (1973). Their main problem is a violation of the plausible reduction principle: assume that a general method of inference, applied to data (y, z), leads to an answer that in fact depends on z alone. Then the same answer should appear if the same method is applied to z alone. A Bayesian implementation of this principle runs as follows: assume first that the probability density p(y, z |,) depends on the parameter = (,) in such a way that the marginal density p(z |) only depends upon. Then the following implication should hold: if (a) the marginal posterior density ( | y, z) depends on the data (y, z) only through z, then (b) this ( | z) should be proportional to a()p(z |) for some function a(), so that it is proportional to a posterior based solely on the z data. For a proper prior (,) this can be shown to hold with a() being the appropriate marginal prior (). Dawid, Stone and Zidek (1973) gave several examples where the implication above is violated by improper priors of the kind that we sometimes expect to have in objective Bayes inference. For our purpose, the interesting case is when there is a transformation group G defined on the parameter space. Under the assumption that is maximal invariant under G and making some regularity conditions, it is then first shown by Dawid, Stone and Zidek (1973) that it necessarily follows that p(z |,) only depends upon, next (a) is shown to hold always, and finally (b) holds if and only if the prior is of the form G(d) d, where G is right Haar measure, and the measure
[90] HELLAND, I. S. (2001). Reduction of regression models under sy mmetry. In Algebraic Methods in Statistics and Probability (M. Viana and D. Richards, eds.) 139-153. Amer. Math. Soc., Providence, RI. · Zbl 1012.62077
[91] HELLAND, I. S. (2002). Statistical inference under a fixed sy mmetry group. Available at http:// www.math.uio.no/ ingeh/. URL:
GUILFORD, CONNECTICUT 06437 E-MAIL: stevepincus@alum.mit.edu in McCullagh (1980). Suppose we are dealing with a universe where the natural models for handling of binary responses are the logistic regression models. This could be some socioeconomic research area where peoples' attitudes to various features of brands or service levels are recorded on a binary scale, and the interest lies in the dependence of these attitudes on all sorts of background variables. How do we extend this universe to deal with ordered categorical responses, for example, on three-point positive/indifferent/negative scales? A natural requirement seems to be that if data are dichotomized by the (arbitrary) selection of a cutpoint (putting, for example, negative and indifferent together in a single category), then the marginal model coming out of this is a logistic regression model. This is, after all, just a way of recording a binary response, and even though it would hurt any statistician to throw away information in this way, it is done all the time on more invisible levels. Another natural requirement is that the parameters of interest-with the constant term as an obvious exception-should not depend on how the cutpoint is selected. It is easy to show that these two requirements are met by one and only one class of models for ordered responses, namely the models that can and Nelder (1989). Thus, we have here the absurd situation that the potentially canonical-but unfortunately nonexisting-answer to a simple and canonical question results in a collection of very useful methods. The overdispersion models exist as perfectly respectable operational objects, but not as mathematical objects. My personal opinion [Tjur (1998)] is that the simplest way of giving these models a concrete interpretation goes via approximation by nonlinear models for normal data and a small adjustment of the usual estimation method for these models. But neither this, nor the concept of quasi-likelihood, answers the fundamental question whether there is a way of modifying the conditions (1) and (2) above in such a way that a meaningful theory of generalized linear models with overdispersion comes out as the unique answer. It is tempting to ask, in the present context, whether it is a necessity at all that these models "exist" in the usual sense. Is it so, perhaps, that after a century or two people will find this question irrelevant, just as we find old discussions about existence of the number + irrelevant? If this is the case, a new attitude to statistical models is certainly required.
has recently been obtained by Wichura (2001). Fraser and Reid ask whether category theory can do more than provide a framework. My experience here is similar to Huber's, namely that category theory is well suited for this purpose but, as a branch of logic, that is all we can expect from it. Regarding the coefficient of variation, I agree that there are applications in which this is a useful and natural parameter or statistic, just as there are (a few) applications in which the correlation coefficient is useful. The groups used in this paper are such that the origin is either fixed or completely arbitrary. In either case there is no room for hedging. In practice, things are rarely so clear cut. In order to justify the coefficient of variation, it seems to me that the applications must be such that the scale of measurement has a reasonably well-defined origin relevant to the problem. The Cauchy model with the real fractional linear group was originally used as an example to highlight certain inferential problems. I do not believe I have encountered an application in which it would be easy to make a convincing case for the relevance of this group. Nevertheless, I think it is helpful to study such examples for the light they may shed on foundational matters. The fact that the median is not a natural subparameter is an insight that casts serious doubt on the relevance of the group in "conventional" applications. To turn the argument around, the fact that the Cauchy model is closed under real fractional linear transformation is not, in itself, an adequate reason to choose that group as the base category. In that sense, I agree with a primary thesis of Fraser's Structure of Inference that the group supersedes the probability model. Tjur's remarks capture the spirit of what I am attempting to do. In the cumulative logit model, it is clear intuitively what is meant by the statement that the parameter of interest should not depend on how the cutpoints are selected. As is often the case, what is intuitively clear is not so easy to express in mathematical terms. It does not mean that the maximum-likelihood estimate is unaffected by this choice. For that reason, although Tjur's second condition on overdispersion models has a certain appeal, I do not think it carries the same force as the first. His description of natural subparameters in regression is a model of clarity.
given the values on the contour (Matheron, 1971). Both processes are also conformal, but the similarity ends there. The set of conformal processes is also closed under addition of independent processes. Thus, the sum of white noise and W is conformal but not Markov. Beyond convolutions of white noise and W, it appears most unlikely that there exists another conformal process with Gaussian increments. Whittle's (1954) family of stationary Gaussian processes has the Markov property [Chilès and Delfiner (1999)] but the family is not closed under conformal maps nor under convolution.
CHICAGO, ILLINOIS 60637-1514 E-MAIL: pmcc@galton.uchicago.edu
