×

Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. (English) Zbl 1012.62088

Summary: We study a multi-armed bandit problem in a setting where covariates are available. We take a nonparametric approach to estimate the functional relationship between the response (reward) and the covariates. The estimated relationships and appropriate randomization are used to select a good arm to play for a greater expected reward. Randomization helps balance the tendency to trust the currently most promising arm with further exploration of other arms. It is shown that, with some familiar nonparametric methods (e.g., histogram), the proposed strategy is strongly consistent in the sense that the accumulated reward is asymptotically equivalent to that based on the best arm (which depends on the covariates) almost surely.

MSC:

62L05 Sequential statistical design
62G08 Nonparametric regression and quantile regression
62C25 Compound decision problems in statistical decision theory
Full Text: DOI

References:

[1] AUER, P., CESA-BIANCHI, N., FREUND, Y. and SCHAPIRE, R. E. (1995). Gambling in a rigged casino: the adversarial multi-armed bandit problem. In 36th Annual Symposium on Foundations of Computer Science 322-331. IEEE Computer Society Press, Los Alamitos, CA. · Zbl 0938.68920
[2] BERRY, D. A., CHEN, R. W., ZAME, A., HEATH, D. C. and SHEPP, L. A. (1997). Bandit problems with infinitely many arms. Ann. Statist. 25 2103-2116. · Zbl 0881.62083 · doi:10.1214/aos/1069362389
[3] BERRY, D. A. and FRISTEDT, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, New York. · Zbl 0659.62086
[4] BIRGÉ, L. and MASSART, P. (1998). Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli 4 329-375. · Zbl 0954.62033 · doi:10.2307/3318720
[5] CLAYTON, M. K. (1989). Covariate models for Bernoulli bandits. Sequential Anal. 8 405-426. · Zbl 0703.62080 · doi:10.1080/07474948908836190
[6] DEVROYE, L. and GYÖRFI, L. (1985). Distribution-free exponential bounds on the l1 error of partitioning estimates of a regression function. In Proceedings of the Fourth Pannonian Symposium on Mathematical Statistics (F. Konecny, J. Mogyoródi and W. Wertz, eds.) 67-76. Akadémiai Kiadó, Budapest. · Zbl 0607.62040
[7] DEVROYE, L., GYÖRFI, L., KRZY ZAK, A. and LUGOSI, G. (1994). On the strong universal consistency of nearest neighbor regression function estimates. Ann. Statist. 22 1371-1385. · Zbl 0817.62038 · doi:10.1214/aos/1176325633
[8] DEVROYE, L., GYÖRFI, L. and LUGOSI, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York. · Zbl 0853.68150
[9] FAN, J. and GIJBELS, I. (1996). Local Polynomial Modeling and Its Applications. Chapman and Hall, New York. · Zbl 0873.62037
[10] GITTINS, J. C. (1989). Multi-armed Bandit Allocation Indices. Wiley, New York. · Zbl 0699.90068
[11] GRATCH, J., DEJONG, G. and YANG, Y. (1994). Rational learning: finding a balance between utility and efficiency. Selecting Models from Data: Artificial Intelligence and Statistics. Lecture Notes in Statist. 89 11-20. Springer, New York.
[12] LAI, T. L. and ROBBINS, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6 4-22. · Zbl 0568.62074 · doi:10.1016/0196-8858(85)90002-8
[13] LAI, T. L. and YAKOWITZ, S. (1995). Machine learning and nonparametric bandit theory. IEEE Trans. Automat. Control 40 1199-1209. · Zbl 0883.62090 · doi:10.1109/9.400491
[14] NOBEL, A. (1996). Histogram regression estimation using data-dependent partitions. Ann. Statist. 24 1084-1105. · Zbl 0862.62038 · doi:10.1214/aos/1032526958
[15] POLLARD, D. (1984). Convergence of Stochastic Processes. Springer, New York. · Zbl 0544.60045
[16] ROBBINS, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58 527-535. · Zbl 0049.37009 · doi:10.1090/S0002-9904-1952-09620-8
[17] SARKAR, J. (1991). One-armed bandit problems with covariates. Ann. Statist. 19 1978-2002. · Zbl 0757.62038 · doi:10.1214/aos/1176348382
[18] STONE, C. S. (1977). Consistent nonparametric regression. Ann. Statist. 5 595-620. · Zbl 0366.62051 · doi:10.1214/aos/1176343886
[19] VAN DER VAART, A. W. and WELLNER, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York. · Zbl 0862.60002
[20] WOODROOFE, M. (1979). A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74 799-806. · Zbl 0442.62063 · doi:10.2307/2286402
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.