×

zbMATH — the first resource for mathematics

Optimal online learning for nonlinear belief models using discrete priors. (English) Zbl 1457.90167
Summary: We consider an optimal learning problem where we are trying to learn a function that is nonlinear in unknown parameters in an online setting. We formulate the problem as a dynamic program, provide the optimality condition using Bellman’s equation, and propose a multiperiod lookahead policy to overcome the nonconcavity in the value of information. We adopt a sampled belief model, which we refer to as a discrete prior. For an infinite-horizon problem with discounted cumulative rewards, we prove asymptotic convergence properties under the proposed policy, a rare result for online learning. We then demonstrate the approach in three different settings: a health setting where we make medical decisions to maximize healthcare response over time, a dynamic pricing setting where we make pricing decisions to maximize the cumulative revenue, and a clinical pharmacology setting where we make dosage controls to minimize the deviation between actual and target effects.
The e-companion is available at https://doi.org/10.1287/opre.2019.1921.
MSC:
90C39 Dynamic programming
90C90 Applications of mathematical programming
Software:
EGO; POMDPS
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. Learn. Theory (COLT) (PMLR, Edinburgh, UK), 39.1-39.26.Google Scholar
[2] Audibert JY, Bubeck S (2010) Best arm identification in multi-armed bandits. Proc. 23rd Annual Conf. Learn. Theory (COLT), Haifa, Israel.Google Scholar
[3] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235-256.Crossref, Google Scholar · Zbl 1012.68093
[4] Bertsimas D, Nino-Mora J (2000) Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res. 48(1):80-90.Link, Google Scholar · Zbl 1106.90383
[5] Bertsimas D, Perakis G (2006) Dynamic pricing: A learning approach. Lawphongpanich S, Hearn DW, eds. Mathematical and Computational Models for Congestion Charging (Springer, Boston), 45-79.Crossref, Google Scholar · Zbl 1115.90061
[6] Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407-1420.Link, Google Scholar · Zbl 1233.90011
[7] Broder J, Rusmevichientong P (2012) Dynamic pricing under a general parametric choice model. Oper. Res. 60(4):965-980.Link, Google Scholar · Zbl 1260.91094
[8] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1-122.Google Scholar · Zbl 1281.91051
[9] Bubeck S, Stoltz G, Szepesvári C, Munos R (2009) Online optimization in x-armed bandits. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Red Hook, NY), 201-208.Google Scholar
[10] Burnetas AN, Katehakis MN (1997) Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1):222-255.Link, Google Scholar · Zbl 0871.90103
[11] Chen S, Reyes KG, Gupta M, McAlpine MC, Powell WB (2015) Optimal learning in experimental design using the knowledge gradient policy with application to characterizing the nanoemulsion stability. SIAM/ASA J. Uncertainty Quantification 3:320-345.Crossref, Google Scholar · Zbl 1327.62098
[12] Chick SE, Gans N (2009) Economic analysis of simulation selection problems. Management Sci. 55(3):421-437.Link, Google Scholar
[13] Chick SE, Gans N, Yapar O (2018) Bayesian sequential learning for clinical trials of multiple correlated medical interventions. INSEAD Working Paper 2018/20/TOM/ACGRE, INSEAD, Fontainebleau, France.Google Scholar
[14] DeGroot M (1970) Optimal Statistical Decisions (McGraw-Hill, New York).Google Scholar
[15] Deisenroth MP, Neumann G, Peters J (2013) A survey on policy search for robotics. Foundations Trends Robotics 2(12):1-142.Google Scholar
[16] Duff MO (1995) Q-learning for bandit problems. Prieditis A, Russell S, eds. Proc. 12th Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco), 209-217.Google Scholar
[17] Edwards J, Fearnhead P, Glazebrook K (2017) On the identification and mitigation of weaknesses in the knowledge gradient policy for multi-armed bandits. Probab. Engrg. Inform. Sci. 31(2):239-263.Crossref, Google Scholar · Zbl 1414.91105
[18] Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Red Hook, NY), 586-594.Google Scholar
[19] Frazier P, Powell W, Dayanik S (2009) The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21(4):599-613.Link, Google Scholar · Zbl 1243.91014
[20] Frazier PI, Powell WB (2010) Paradoxes in learning and the marginal value of information. Decision Anal. 7(4):378-403.Link, Google Scholar
[21] Frazier PI, Powell WB, Dayanik S (2008) A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5):2410-2439.Crossref, Google Scholar · Zbl 1274.62155
[22] Gittins J, Jones D (1974) A dynamic allocation index for the sequential design of experiments. Gani J, ed. Progress in Statistics (North-Holland, Amsterdam), 241-266.Google Scholar · Zbl 0303.62064
[23] Gittins J, Glazebrook K, Weber R (2011) Multi-Armed Bandit Allocation Indices, 2nd ed. (John Wiley & Sons, Hoboken, NJ).Google Scholar
[24] Holford NHG, Sheiner LB (1981) Understanding the dose-effect relationship. Clinical Pharmacokinetics 6(6):429-453.Crossref, Google Scholar
[25] Hu J, Fu MC, Ramezani VR, Marcus SI (2007) An evolutionary random policy search algorithm for solving Markov decision processes. INFORMS J. Comput. 19(2):161-174.Link, Google Scholar · Zbl 1241.90173
[26] Huang D, Allen TT, Notz WI, Zeng N (2006) Global optimization of stochastic black-box systems via sequential kriging meta-models. J. Global Optim. 34(3):441-466.Crossref, Google Scholar · Zbl 1098.90097
[27] Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4):455-492.Crossref, Google Scholar · Zbl 0917.90270
[28] Kaelbling LP (1993) Learning in Embedded Systems (MIT Press, Cambridge, MA).Crossref, Google Scholar
[29] Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: A survey. J. Artificial Intelligence Res. 4(1):237-285.Crossref, Google Scholar
[30] Katehakis MN, Veinott AF (1987) The multi-armed bandit problem: Decomposition and computation. Math. Oper. Res. 12(2):262-268.Link, Google Scholar · Zbl 0618.90097
[31] Keskin NB, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142-1167.Link, Google Scholar · Zbl 1368.91103
[32] Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4-22.Crossref, Google Scholar · Zbl 0568.62074
[33] Lai TL (1987) Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15(3):1091-1114.Crossref, Google Scholar · Zbl 0643.62054
[34] Lovejoy WS (1991) A survey of algorithmic methods for partially observed Markov decision processes. Ann. Oper. Res. 28(1):47-66.Crossref, Google Scholar · Zbl 0717.90086
[35] Mannor S, Rubinstein R, Gat Y (2003) The cross entropy method for fast policy search. Fawcett T, Mishra N, eds. Proc. 20th Internat. Conf. Machine Learn. (AAAI Press, Washington, DC), 512-519.Google Scholar
[36] Ng AY, Jordan M (2000) Pegasus: A policy search method for large MDPS and POMDPS. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., San Francisco), 406-415.Google Scholar
[37] Peshkin L, Kim KE, Meuleau N, Kaelbling LP (2000) Learning to cooperate via policy search. Boutilier C, Goldszmidt M, eds. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., San Francisco), 489-496.Google Scholar
[38] Powell WB (2011) Approximate Dynamic Programming: Solving the Curses of Dimensionality (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar · Zbl 1242.90002
[39] Powell WB (2016) A unified framework for optimization under uncertainty. Gupta A, Capponi A, eds. Optimization Challenges in Complex, Networked and Risky Systems, TuTORials in Operations Research (INFORMS, Catonsville, MD), 45-83Link, Google Scholar
[40] Powell WB, Meisel S (2016) Tutorial on stochastic optimization in energy—Part I: Modeling and policies. IEEE Trans. Power Systems 31(2):1459-1467.Crossref, Google Scholar
[41] Powell WB, Ryzhov IO (2012) Optimal Learning (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
[42] Ross S, Pineau J, Paquet S, Chaib-draa B (2008) Online planning algorithms for POMDPS. J. Artificial Intelligence Res. 32(1):663-704.Crossref, Google Scholar · Zbl 1182.68265
[43] Russo D, Roy BV (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221-1243.Link, Google Scholar · Zbl 1310.93091
[44] Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180-195.Link, Google Scholar · Zbl 1241.90201
[45] Silver D,
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.