Approximation methods for supervised learning.

*(English)* Zbl 1146.62322
Summary: Let $\rho$ be an unknown Borel measure defined on the space $Z := X \times Y$ with $X \subset \Bbb R^{d}$ and $Y = [-M,M]$. Given a set $\bold {z}$ of $m$ samples $z_{i} =(x_{i},y_{i})$ drawn according to $\rho$, the problem of estimating a regression function $f_{\rho}$ using these samples is considered. The main focus is to understand what is the rate of approximation, measured either in expectation or probability, that can be obtained under a given prior $f_{\rho} \in \Theta $, i.e., under the assumption that $f_{\rho}$ is in the set $\Theta$, and what are possible algorithms for obtaining optimal or semioptimal (up to logarithms) results. The optimal rate of decay in terms of $m$ is established for many priors given either in terms of smoothness of $f_{\rho}$ or its rate of approximation measured in one of several ways.
This optimal rate is determined by two types of results. Upper bounds are established using various tools in approximation such as entropy, widths, and linear and nonlinear approximation. Lower bounds are proved using Kullback-Leibler information together with Fano inequalities and a certain type of entropy. A distinction is drawn between algorithms which employ knowledge of the prior in the construction of the estimator and those that do not. Algorithms of the second type which are universally optimal for a certain range of priors are given.

##### MSC:

62G08 | Nonparametric regression |

68T05 | Learning and adaptive systems |

65C60 | Computational problems in statistics |

41A30 | Approximation by other special function classes |