The multi-armed bandit problem with covariates. (English) Zbl 1360.62436

Summary: We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably “localized” static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.


62L12 Sequential estimation
62L05 Sequential statistical design
62G08 Nonparametric regression and quantile regression
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI arXiv Euclid


[1] Audibert, J.-Y. and Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res. 11 2785-2836. · Zbl 1242.91034
[2] Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608-633. · Zbl 1118.62041
[3] Audibert, J. Y. and Tsybakov, A. B. B. (2005). Fast learning rates for plug-in classifiers under the margin condition. Preprint, Laboratoire de Probabilités et Modèles Aléatoires, Univ. Paris VI and VII. Available at . · Zbl 1118.62041
[4] Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47 235-256. · Zbl 1012.68093
[5] Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Period. Math. Hungar. 61 55-65. · Zbl 1240.68164
[6] Bather, J. A. (1981). Randomized allocation of treatments in sequential experiments. J. R. Stat. Soc. Ser. B Stat. Methodol. 43 265-292. · Zbl 0475.62062
[7] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction , Learning , and Games . Cambridge Univ. Press, Cambridge. · Zbl 1114.91001
[8] Even-Dar, E., Mannor, S. and Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7 1079-1105. · Zbl 1222.68195
[9] Goldenshluger, A. and Zeevi, A. (2009). Woodroofe’s one-armed bandit problem revisited. Ann. Appl. Probab. 19 1603-1633. · Zbl 1168.62071
[10] Goldenshluger, A. and Zeevi, A. (2011). A note on performance limitations in bandit problems with side information. IEEE Trans. Inform. Theory 57 1707-1713. · Zbl 1366.91040
[11] Hazan, E. and Megiddo, N. (2007). Online learning with prior knowledge. In Learning Theory. Lecture Notes in Computer Science 4539 499-513. Springer, Berlin. · Zbl 1203.68152
[12] Juditsky, A., Nazin, A. V., Tsybakov, A. B. and Vayatis, N. (2008). Gap-free bounds for stochastic multi-armed bandit. In Proceedings of the 17 th IFAC World Congress .
[13] Kakade, S., Shalev-Shwartz, S. and Tewari, A. (2008). Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25 th Annual International Conference on Machine Learning ( ICML 2008) (A. McCallum and S. Roweis, eds.) 440-447. Omnipress, Helsinki, Finland.
[14] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6 4-22. · Zbl 0568.62074
[15] Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 20 (J. C. Platt, D. Koller, Y. Singer and S. Roweis, eds.) 817-824. MIT Press, Cambridge, MA.
[16] Lu, T., Pál, D. and Pál, M. (2010). Showing relevant ads via Lipschitz context multi-armed bandits. JMLR : Workshop and Conference Proceedings 9 485-492.
[17] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808-1829. · Zbl 0961.62058
[18] Rigollet, P. and Zeevi, A. (2010). Nonparametric bandits with covariates. In COLT (A. Tauman Kalai and M. Mohri, eds.) 54-66. Omnipress, Haifa, Israel.
[19] Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. ( N.S. ) 58 527-535. · Zbl 0049.37009
[20] Slivkins, A. (2011). Contextual bandits with similarity information. JMLR : Workshop and Conference Proceedings 19 679-701. · Zbl 1319.62013
[21] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135-166. · Zbl 1105.62353
[22] Vogel, W. (1960). An asymptotic minimax theorem for the two armed bandit problem. Ann. Math. Statist. 31 444-451. · Zbl 0093.15701
[23] Wang, C.-C., Kulkarni, S. R. and Poor, H. V. (2005). Bandit problems with side observations. IEEE Trans. Automat. Control 50 338-355. · Zbl 1105.62353
[24] Woodroofe, M. (1979). A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74 799-806. · Zbl 0442.62063
[25] Yang, Y. and Zhu, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Ann. Statist. 30 100-121. · Zbl 1012.62088
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.