×

Optimal learning with non-Gaussian rewards. (English) Zbl 1345.60039

Summary: We propose a novel theoretical characterization of the optimal “Gittins index” policy in multi-armed bandit problems with non-Gaussian, infinitely divisible reward distributions. We first construct a continuous-time, conditional Lévy process which probabilistically interpolates the sequence of discrete-time rewards. When the rewards are Gaussian, this approach enables an easy connection to the convenient time-change properties of a Brownian motion. Although no such device is available in general for the non-Gaussian case, we use optimal stopping theory to characterize the value of the optimal policy as the solution to a free-boundary partial integro-differential equation (PIDE). We provide the free-boundary PIDE in explicit form under the specific settings of exponential and Poisson rewards. We also prove continuity and monotonicity properties of the Gittins index in these two problems, and discuss how the PIDE can be solved numerically to find the optimal index value of a given belief state.

MSC:

60G40 Stopping times; optimal stopping problems; gambling theory
60J75 Jump processes (MSC2010)
60G51 Processes with independent increments; Lévy processes
35R09 Integro-partial differential equations
45K05 Integro-partial differential equations
65R99 Numerical methods for integral equations, integral transforms
PDFBibTeX XMLCite
Full Text: DOI Euclid

References:

[1] Aalto, S., Ayesta, U. and Righter, R. (2011). Properties of the Gittins index with application to optimal scheduling. Prob. Eng. Inf. Sci. 25 , 269-288. · Zbl 1233.90104 · doi:10.1017/S0269964811000015
[2] Agarwal, D., Chen, B.-C. and Elango, P. (2009). Explore/exploit schemes for web content optimization. In Proceedings of the 9th IEEE International Conference on Data Mining , IEEE, New York, pp. 1-10.
[3] Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 , 235-256. · Zbl 1012.68093 · doi:10.1023/A:1013689704352
[4] Berry, D. A. and Pearson, L. M. (1985). Optimal designs for clinical trials with dichotomous responses. Statist. Medicine 4 , 497-508.
[5] Brezzi, M. and Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. J. Econom. Dynamics Control 27 , 87-108. · Zbl 1168.91324 · doi:10.1016/S0165-1889(01)00028-8
[6] Buonaguidi, B. and Muliere, P. (2013). Sequential testing problems for Lévy processes. Sequential Anal. 32 , 47-70. · Zbl 1261.62075 · doi:10.1080/07474946.2013.752169
[7] Caro, F. and Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer goods. Manag. Sci. 53 , 276-292. · Zbl 1232.91420 · doi:10.1287/mnsc.1060.0613
[8] Chhabra, M. and Das, S. (2011). Learning the demand curve in posted-price digital goods auctions. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems , pp. 63-70.
[9] Chick, S. E. (2006). Subjective probability and Bayesian methodology. In Handbooks in Operations Research and Management Science , Vol. 13, Simulation , North-Holland, Amsterdam, pp. 225-258.
[10] Chick, S. E. and Frazier, P. I. (2012). Sequential sampling with economics of selection procedures. Manag. Sci. 58 , 550-569.
[11] Chick, S. E. and Gans, N. (2009). Economic analysis of simulation selection problems. Manag. Sci. 55 , 421-437.
[12] Chick, S. E. and Inoue, K. (2001). New procedures to select the best simulated system using common random numbers. Manag. Sci. 47 , 1133-1149.
[13] Çinlar, E. (2003). Conditional Lévy processes. Comput. Math. Appl. 46 , 993-997. · Zbl 1045.60043 · doi:10.1016/S0898-1221(03)90113-1
[14] Çinlar, E. (2011). Probability and Stochastics . Springer, New York. · Zbl 1226.60001 · doi:10.1007/978-0-387-87859-1
[15] Cohen, A. and Solan, E. (2013). Bandit problems with Lévy processes. Math. Operat. Res. 38 , 92-107. · Zbl 1304.60048 · doi:10.1287/moor.1120.0564
[16] Coquet, F. and Toldo, S. (2007). Convergence of values in optimal stopping and convergence of optimal stopping times. Electron. J. Prob. 12 , 207-228. · Zbl 1144.62067 · doi:10.1214/EJP.v12-288
[17] DeGroot, M. H. (1970). Optimal Statistical Decisions . McGraw-Hill, New York. · Zbl 0225.62006
[18] Dynkin, E. B. (1965). Markov Processes . Academic Press, New York. · Zbl 0132.37901
[19] El Karoui, N. and Karatzas, I. (1994). Dynamic allocation problems in continuous time. Ann. Appl. Prob. 4 , 255-286. · Zbl 0831.93069 · doi:10.1214/aoap/1177005062
[20] Farias, V. F. and Van Roy, B. (2010). Dynamic pricing with a prior on market response. Operat. Res. 58 , 16-29. · Zbl 1233.91104 · doi:10.1287/opre.1090.0729
[21] Filliger, R. and Hongler, M.-O. (2007). Explicit Gittins indices for a class of superdiffusive processes. J. Appl. Prob. 44 , 554-559. · Zbl 1133.60319 · doi:10.1239/jap/1183667421
[22] Frazier, P. I. and Powell, W. B. (2011). Consistency of sequential Bayesian sampling policies. SIAM J. Control Optimization 49 , 712-731. · Zbl 1284.62170 · doi:10.1137/090775026
[23] Frazier, P. I., Powell, W. B. and Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM J. Control Optimization 47 , 2410-2439. · Zbl 1274.62155 · doi:10.1137/070693424
[24] Gittins, J. C. and Jones, D. M. (1979). A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika 66 , 561-565.
[25] Gittins, J. C. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. Ann. Statist. 20 , 1625-1636. · Zbl 0760.62080 · doi:10.1214/aos/1176348788
[26] Gittins, J. C., Glazebrook, K. D. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices , 2nd edn. John Wiley, Oxford. · Zbl 1401.90257
[27] Glazebrook, K. D. and Minty, R. (2009). A generalized Gittins index for a class of multiarmed bandits with general resource requirements. Math. Operat. Res. 34 , 26-44. · Zbl 1218.90208 · doi:10.1287/moor.1080.0342
[28] Glazebrook, K. D., Meissner, J. and Schurr, J. (2013). How big should my store be? On the interplay between shelf-space, demand learning and assortment decisions. Working paper, Lancaster University.
[29] Itô, K., Barndorff-Nielsen, O. E. and Sato, K.-I. (2004). Stochastic Processes: Lectures Given at Aarhus University . Springer, Berlin.
[30] Jouini, W. and Moy, C. (2012). Channel selection with Rayleigh fading: a multi-armed bandit framework. In Proceedings of the 13th IEEE International Workshop on Signal Processing Advances in Wireless Communications , IEEE, New York, pp. 299-303.
[31] Kaspi, H. and Mandelbaum, A. (1995). Lévy bandits: multi-armed bandits driven by Lévy processes. Ann. Appl. Prob. 5 , 541-565. · Zbl 0830.60065 · doi:10.1214/aoap/1177004777
[32] Katehakis, M. N. and Veinott, A. F., Jr. (1987). The multi-armed bandit problem: decomposition and computation. Math. Operat. Res. 12 , 262-268. · Zbl 0618.90097 · doi:10.1287/moor.12.2.262
[33] Kyprianou, A. E. (2006). Introductory Lectures on Fluctuations of Lévy Processes with Applications . Springer, Berlin. · Zbl 1104.60001 · doi:10.1007/978-3-540-31343-4
[34] Lamberton, D. and Pagès, G. (1990). Sur l’approximation des réduites. Ann. Inst. H. Poincaré Prob. Statist. 26 , 331-355. · Zbl 0704.60042
[35] Lariviere, M. A. and Porteus, E. L. (1999). Stalking information: Bayesian inventory management with unobserved lost sales. Manag. Sci. 45 , 346-363. · Zbl 1231.90043 · doi:10.1287/mnsc.45.3.346
[36] Mandelbaum, A. (1986). Discrete multiarmed bandits and multiparameter processes. Prob. Theory Relat. Fields 71 , 129-147. · Zbl 0788.60056 · doi:10.1007/BF00366276
[37] Mandelbaum, A. (1987). Continuous multi-armed bandits and multiparameter processes. Ann. Prob. 15 , 1527-1556. · Zbl 0657.62098 · doi:10.1214/aop/1176991992
[38] Monroe, I. (1978). Processes that can be embedded in Brownian motion. Ann. Prob. 6 , 42-56. · Zbl 0392.60057 · doi:10.1214/aop/1176995609
[39] Müller, A. (1997). How does the value function of a Markov decision process depend on the transition probabilities? Math. Operat. Res. 22 , 872-885. · Zbl 0892.90178 · doi:10.1287/moor.22.4.872
[40] Müller, A. and Stoyan, D. (2002). Comparison Methods for Stochastic Models and Risks . John Wiley, Chichester. · Zbl 0999.60002
[41] Peskir, G. and Shiryaev, A. N. (2006). Optimal Stopping and Free-Boundary Problems . Birkhäuser, Basel. · Zbl 1115.60001
[42] Powell, W. B. and Ryzhov, I. O. (2012). Optimal Learning . John Wiley, Hoboken, NJ. · Zbl 1331.90042
[43] Ryzhov, I. O. and Powell, W. B. (2011). The value of information in multi-armed bandits with exponentially distributed rewards. In Proceedings of the 2011 International Conference on Computational Science , pp. 1363-1372.
[44] Ryzhov, I. O., Powell, W. B. and Frazier, P. I. (2012). The knowledge gradient algorithm for a general class of online learning problems. Operat. Res. 60 , 180-195. · Zbl 1241.90201 · doi:10.1287/opre.1110.0999
[45] Sato, K.-I. (1999). Lévy Processes and Infinitely Divisible Distributions . Cambridge University Press. · Zbl 0973.60001
[46] Shaked, M. and Shanthikumar, J. G. (2007). Stochastic Orders . Springer, New York. · Zbl 0806.62009
[47] Steele, J. M. (2001). Stochastic Calculus and Financial Applications . Springer, New York. · Zbl 0962.60001
[48] Van Moerbeke, P. (1976). On optimal stopping and free boundary problems. Arch. Rational Mech. Anal. 60 , 101-148. · Zbl 0336.35047 · doi:10.1007/BF00250676
[49] Vazquez, E. and Bect, J. (2010). Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Statist. Planning Infer. 140 , 3088-3095. · Zbl 1419.62200 · doi:10.1016/j.jspi.2010.04.018
[50] Wang, X. and Wang, Y. (2010). Optimal investment and consumption with stochastic dividends. Appl. Stoch. Models Business Industry 26 , 792-808. · Zbl 1226.91071 · doi:10.1002/asmb.823
[51] Yao, Y.-C. (2006). Some results on the Gittins index for a normal reward process. In Time Series and Related Topics , Institute of Mathematical Statistics, Beachwood, OH, pp. 284-294. · Zbl 1268.60053 · doi:10.1214/074921706000001111
[52] Yu, Y. (2011). Structural properties of Bayesian bandits with exponential family distributions. Preprint. Available at http://arxiv.org/abs/1103.3089.
[53] Zhang, Q., Seetharaman, P. B. and Narasimhan, C. (2012). The indirect impact of price deals on households’ purchase decisions through the formation of expected future prices. J. Retailing 88 , 88-101. \endharvreferences
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.