##
**Prediction when fitting simple models to high-dimensional data.**
*(English)*
Zbl 1418.62237

The objective of this paper is to explore the equivalence of linear and nonlinear methods in high-dimensional inference problems. The linear subset regression is studied in the context of a high-dimensional linear model. It is shown that the mean squared prediction error of the best linear predictor is close to the mean squared prediction error of the corresponding Bayes predictor.

Authors’ abstract: We study linear subset regression in the context of a high-dimensional linear model. Consider \(y=\vartheta +\theta 'z+\epsilon\) with univariate response \(y\) and a \(d$-vector of random regressors \(z$, and a submodel where \(y\) is regressed on a set of \(p\) explanatory variables that are given by \(x=M'z$, for some \(d\times p\) matrix \(M\). Here, ``high-dimensional'' means that the number \(d\) of available explanatory variables in the overall model is much larger than the number \(p\) of variables in the submodel. In this paper, we present Pinsker-type results for prediction of \(y\) given \(x\). In particular, we show that the mean squared prediction error of the best linear predictor of \(y\) given \(x\) is close to the mean squared prediction error of the corresponding Bayes predictor \(\mathbb{E}[y\|x]$, provided only that \(p/\log d\) is small. We also show that the mean squared prediction error of the (feasible) least-squares predictor computed from \(n\) independent observations of \((y,x)\) is close to that of the Bayes predictor, provided only that both \(p/\log d\) and \(p/n\) are small. Our results hold uniformly in the regression parameters and over large collections of distributions for the design variables \(z\).

Authors’ abstract: We study linear subset regression in the context of a high-dimensional linear model. Consider \(y=\vartheta +\theta 'z+\epsilon\) with univariate response \(y\) and a \(d$-vector of random regressors \(z$, and a submodel where \(y\) is regressed on a set of \(p\) explanatory variables that are given by \(x=M'z$, for some \(d\times p\) matrix \(M\). Here, ``high-dimensional'' means that the number \(d\) of available explanatory variables in the overall model is much larger than the number \(p\) of variables in the submodel. In this paper, we present Pinsker-type results for prediction of \(y\) given \(x\). In particular, we show that the mean squared prediction error of the best linear predictor of \(y\) given \(x\) is close to the mean squared prediction error of the corresponding Bayes predictor \(\mathbb{E}[y\|x]$, provided only that \(p/\log d\) is small. We also show that the mean squared prediction error of the (feasible) least-squares predictor computed from \(n\) independent observations of \((y,x)\) is close to that of the Bayes predictor, provided only that both \(p/\log d\) and \(p/n\) are small. Our results hold uniformly in the regression parameters and over large collections of distributions for the design variables \(z\).

Reviewer: Denis Sidorov (Irkutsk)

### MSC:

62H12 | Estimation in multivariate analysis |

62F15 | Bayesian inference |

62J05 | Linear regression; mixed models |