×

Bayesian incentive-compatible bandit exploration. (English) Zbl 1451.90079

Summary: As self-interested individuals (“agents”) make decisions over time, they utilize information revealed by other agents in the past and produce information that may help agents in the future. This phenomenon is common in a wide range of scenarios in the Internet economy, as well as in medical decisions. Each agent would like to exploit: select the best action given the current information, but would prefer the previous agents to explore: try out various alternatives to collect information. A social planner, by means of a carefully designed recommendation policy, can incentivize the agents to balance the exploration and exploitation so as to maximize social welfare. We model the planner’s recommendation policy as a multiarm bandit algorithm under incentive-compatibility constraints induced by agents’ Bayesian priors. We design a bandit algorithm which is incentive-compatible and has asymptotically optimal performance, as expressed by regret. Further, we provide a black-box reduction from an arbitrary multiarm bandit algorithm to an incentive-compatible one, with only a constant multiplicative increase in regret. This reduction works for very general bandit settings that incorporate contexts and arbitrary partial feedback.

MSC:

90B50 Management decision making, including multiple objectives
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Agarwal A, Hsu D, Kale S, Langford J, Li L, Schapire R (2014) Taming the monster: A fast and simple algorithm for contextual bandits. 31st Internat. Conf. Machine Learn. (ICML), 1638-1646.Google Scholar
[2] Alon N, Cesa-Bianchi N, Dekel O, Koren T (2015) Online learning with feedback graphs: Beyond bandits. 28th Conf. Learn Theory (COLT), 23-35.Google Scholar
[3] Alon N, Cesa-Bianchi N, Gentile C, Mansour Y (2013) From bandits to experts: A tale of domination and independence. Advances in Neural Information Processing Systems (NIPS), vol. 27 (Curran Associates, Red Hook, NY), 1610-1618.Google Scholar
[4] Arango J, Arts K, Braunschweiger P, Chadwick GL, Forster DG, Hansen K (2012) Good Clinical Practice Guide (CITI Program, Miami).Google Scholar
[5] Athey S, Segal I (2013) An efficient dynamic mechanism. Econometrica 81(6):2463-2485.Google Scholar · Zbl 1304.91080
[6] Audibert JY, Bubeck S (2010) Regret bounds and minimax policies under partial monitoring. J. Machine Learn. Res. 11:2785-2836.Google Scholar · Zbl 1242.91034
[7] Audibert J-Y, Bubeck S, Lugosi G (2011) Minimax policies for combinatorial prediction games. 24th Conf. Learn. Theory (COLT), 107-132.Google Scholar
[8] Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2-3):235-256.Crossref, Google Scholar · Zbl 1012.68093
[9] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48-77.Google Scholar
[10] Babaioff M, Kleinberg R, Slivkins A (2015a) Truthful mechanisms with implicit payment computation. J. ACM 62(2):Article 10.Google Scholar
[11] Babaioff M, Sharma, Y, Slivkins A (2014) Characterizing truthful multi-armed bandit mechanisms. SIAM J. Comput. 43(1):194-230.Google Scholar · Zbl 1308.91061
[12] Babaioff M, Dughmi S, Kleinberg RD, Slivkins A (2015b) Dynamic pricing with limited supply. ACM Trans. Econom. Comput. 3(1):4.Google Scholar
[13] Badanidiyuru A, Kleinberg R, Slivkins A (2018) Bandits with knapsacks. J. ACM 65(3):Article 13.Google Scholar · Zbl 1425.68340
[14] Bahar G, Smorodinsky R, Tennenholtz M (2016) Economic recommendation systems. 16th ACM Conf. Electronic Commerce (EC) (ACM, New York), 757.Google Scholar
[15] Bartók G, Foster DP, Pál D, Rakhlin A, Szepesvári C (2014) Partial monitoring - classification, regret bounds, and algorithms. Math. Oper. Res. 39(4):967-997.Link, Google Scholar · Zbl 1310.91028
[16] Bastani H, Bayati M, Khosravi K (2017) Mostly exploration-free algorithms for contextual bandits. Preprint, submitted April 28, https://arxiv.org/abs/1704.09011.Google Scholar
[17] Bergemann D, Morris S (2016) Information design, Bayesian persuasion and Bayes correlated equilibrium. Amer. Econom. Rev. 106(5):586-591.Google Scholar
[18] Bergemann D, Välimäki J (2010) The dynamic pivot mechanism. Econometrica 78(2):771-789.Google Scholar · Zbl 1229.91206
[19] Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407-1420.Link, Google Scholar · Zbl 1233.90011
[20] Bimpikis K, Papanastasiou Y, Savva N (2018) Crowdsourcing exploration. Management Sci. 64(4):1477-1973.Google Scholar
[21] Bolton P, Harris C (1999) Strategic experimentation. Econometrica 67(2):349-374.Crossref, Google Scholar · Zbl 1023.91500
[22] Bubeck S, Cesa-Bianchi N (2012) Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. (Now Publishers, Boston).Google Scholar · Zbl 1281.91051
[23] Bubeck S, Munos R, Stoltz G (2011) Pure exploration in multi-armed bandit problems. Theoret. Comput. Sci. 412(19):1832-1852.Crossref, Google Scholar · Zbl 1214.62082
[24] Celis LE, Vishnoi NK (2017) Fair personalization. Preprint, submitted July 7, https://arxiv.org/abs/1707.02260.Google Scholar
[25] Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar · Zbl 1114.91001
[26] Che Y-K, Hörner J (2019) Optimal design for social learning. Quart. J. Econom. 133(2):871-925.Google Scholar
[27] Chen B, Frazier PI, Kempe D (2018) Incentivizing exploration by heterogeneous users. Conf. Learn Theory (COLT), 798-818.Google Scholar
[28] Chouldechova A (2017) Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5(2):153-163.Crossref, Google Scholar
[29] Chow S-C, Chang M (2008) Adaptive design methods in clinical trials - A review. Orphanet J. Rare Diseases 3:11.Crossref, Google Scholar
[30] Detry, MA, Lewis RJ, Broglio KR, Connor JT, Berry SM, Berry DA (2012) Standards for the design, conduct, and evaluation of adaptive randomized clinical trials. Report, Patient-Centered Outcomes Research Institute (PCORI), Washington, DC.Google Scholar
[31] Devanur N, Kakade SM (2009) The price of truthfulness for pay-per-click auctions. 10th ACM Conf. Electronic Commerce (EC), 99-106.Google Scholar
[32] Doob JL (1949) Application of the theory of martingales. Le Calcul des Probabilités et ses Applications (Centre National de la Recherche Scientifique, Paris), 23-27.Google Scholar
[33] Dudík M, Hsu D, Kale S, Karampatziakis N, Langford J, Reyzin L, Zhang T (2011) Efficient optimal leanring for contextual bandits. 27th Conf. Uncertainty Artificial Intelligence (UAI) (AUAI Press, Arlington, VA), 169-178.Google Scholar
[34] Dwork C, Hardt M, Pitassi T, Reingold O, Zemel R (2012) Fairness through awareness. Innovations Theoret. Comput. Sci. Conf. (ITCS) (ACM, New York), 214-226.Google Scholar · Zbl 1348.91230
[35] Ely J, Frankel A, Kamenica E (2015) Suspense and Surprise. J. Political Econom. 123(1):215-260.Crossref, Google Scholar
[36] Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. 15th Conf. Learn. Theory (COLT) Lecture Notes in Computer Science, vol 2375 (Springer, Berlin, Heidelberg), 255-270.Google Scholar · Zbl 1050.68059
[37] Even-Dar E, Mannor S, Mansour Y (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Machine Learn. Res. 7:1079-1105.Google Scholar · Zbl 1222.68195
[38] Frazier P, Kempe D, Kleinberg JM, Kleinberg R (2014) Incentivizing exploration. ACM Conf. Econom. Comput. (ACM EC) (ACM, New York), 5-22.Google Scholar
[39] Freidlin B, Simon R (2005) Adaptive signature design: An adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. Clinical Cancer Res. 11(21):7872-7878.Crossref, Google Scholar
[40] Freidlin B, Jiang W, Simon R (2007) Biomarker adaptive threshold design: A procedure for evaluating treatment with possible biomarker-defined subset effect. J. National Cancer Institute 99(13):1036-1043.Crossref, Google Scholar
[41] Freidlin B, Jiang W, Simon R (2010) Cross-validated adaptive signature design. Clinical Cancer Res. 16(2):691-698.Crossref, Google Scholar
[42] Freidlin B, Korn EL, Gray R, Martin A (2008) Multi-arm clinical trials of new agents: Some design considerations. Clinical Cancer Res. 14(14):43-68.Crossref, Google Scholar
[43] Garimella S (2015) New oncology clinical trial designs: What works and what doesn’t? Evidence-Based Oncology 21(10):349-350.Google Scholar
[44] Ghosal S (1996) A review of consistency and convergence rates of posterior distribution. Proc. Varanashi Sympos. Bayesian Inference, Varanasi, India.Google Scholar
[45] Ghosh A, Hummel P (2013) Learning and incentives in user-generated content: multi-armed bandits with endogenous arms. Innovations in Theoretical Computer Science Conf. (ITCS) (ACM, New York), 233-246.Google Scholar · Zbl 1361.68177
[46] Gittins JC (1979) Bandit processes and dynamic allocation indices (with discussion). J. Royal Statist. Soc. B 41(2):148-177.Google Scholar · Zbl 0411.62055
[47] Gittins J, Glazebrook K, Weber R (2011) Multi-Armed Bandit Allocation Indices (John Wiley & Sons).Crossref, Google Scholar · Zbl 1401.90257
[48] Goel A, Khanna S, Null B (2009) The ratio index for budgeted learning, with applications. 20th ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM, Philadelphia), 18-27.Google Scholar · Zbl 1421.68223
[49] Guha S, Munagala K (2007) Approximation algorithms for budgeted learning problems. 39th ACM Sympos. Theory of Computing (STOC) (ACM, New York), 104-113.Google Scholar · Zbl 1232.68180
[50] György A, Linder T, Lugosi G, Ottucsák G (2007) The on-line shortest path problem under partial monitoring. J. Machine Learn. Res. 8:2369-2403.Google Scholar · Zbl 1222.68210
[51] Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems (NIPS), vol. 29 (Curran Associates, Red Hook, NY), 3315-3323.Google Scholar
[52] Hellmich M (2001) Monitoring clinical trials with multiple arms. Biometrics 57(3):892-898.Crossref, Google Scholar · Zbl 1209.62187
[53] Ho C-J, Slivkins A, Vaughan JW (2016) Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. J. Artificial Intelligence Res. 55:317-359.Google Scholar · Zbl 1351.68293
[54] Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58(301):13-30.Crossref, Google Scholar · Zbl 0127.10602
[55] Hörner J, Skrzypacz A (2015) Selling information. J. Political Econom. 24(6):1515-1562.Google Scholar
[56] Immorlica N, Mao J, Slivkins A, Wu S (2019) Bayesian exploration with heterogenous agents. Internat. World Wide Web Conf. Committee (IW3C2), Geneva, Switzerland, 751-761.Google Scholar
[57] Joseph M, Kearns M, Morgenstern JH, Roth A (2016) Fairness in learning: Classic and contextual bandits. Advances in Neural Information Processing Systems (NIPS), vol. 29 (Curran Associates, Red Hook, NY), 325-333..Google Scholar
[58] Kakade SM, Lobel I, Nazerzadeh H (2013) Optimal dynamic mechanism design and the virtual-pivot mechanism. Oper. Res. 61(4):837-854.Link, Google Scholar · Zbl 1291.91082
[59] Kale S, Reyzin L, Schapire RE (2010) Non-stochastic bandit slate problems. Advances in Neural Information Processing Systems (NIPS), vol. 23 (Curran Associates, Red Hook, NY), 1054-1062.Google Scholar
[60] Kamenica E, Gentzkow M (2011) Bayesian persuasion. Amer. Econom. Rev. 101(6):2590-2615.Crossref, Google Scholar
[61] Kannan S, Kearns M, Morgenstern J, Pai M, Roth A, Vohra R, Wu ZS (2017) Fairness incentives for myopic agents. ACM Conf. Econom. Comput. (ACM EC) (ACM, New York), 369-386.Google Scholar
[62] Kannan S, Morgenstern J, Roth A, Waggoner B, Wu ZS (2018) A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Advances in Neural Information Processing Systems (NIPS), vol. 31 (Curran Associates, Red Hook, NY), 369-386.Google Scholar
[63] Kearns M, Roth A, Wu ZS (2017) Meritocratic fairness for cross-population selection. Internat. Conf. Machine Learn. (ICML), 1828-1836.Google Scholar
[64] Keller G, Rady S, Cripps M (2005) Strategic experimentation with exponential bandits. Econometrica 73(1):39-68.Crossref, Google Scholar · Zbl 1152.91333
[65] Kleinberg J, Mullainathan S, Raghavan M (2017) Inherent trade-offs in the fair determination of risk scores. Innovations Theoret. Comput. Sci. Conf. (ITCS), 43:1-43:23.Google Scholar · Zbl 1402.68156
[66] Kleinberg RD, Leighton FT (2003), The value of knowing a demand curve: Bounds on regret for online posted-price auctions. IEEE Sympos. Foundations Comput. Sci. (FOCS), (IEEE, Piscataway, NJ), 594-605.Google Scholar
[67] Kleinberg RD, Waggoner B, Weyl EG (2016) Descending price optimally coordinates search. 17th ACM Conf. Econom. Comput. (ACM-EC) (ACM, New York), 23-24.Google Scholar
[68] Kremer I, Mansour Y, Perry M (2014) Implementing the “wisdom of the crowd.” J. Political Econom. 122(5):988-1012.Google Scholar
[69] Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4-22.Crossref, Google Scholar · Zbl 0568.62074
[70] Langford J, Zhang T (2007) The epoch-greedy algorithm for contextual multi-armed bandits. Advances in Neural Information Processing Systems (NIPS), vol. 20 (Curran Associates, Red Hook, NY), 325-333.Google Scholar
[71] Liu Y, Radanovic G, Dimitrakakis C, Mandal D, Parkes DC (2017) Calibrated fairness in bandits. Preprint, submitted July 6, https://arxiv.org/abs/1707.01875.Google Scholar
[72] Maitland ML, Schilsky RL (2011) Clinical trials in the era of personalized oncology. CA: Cancer J. Clinicians 61(6):365-381.Crossref, Google Scholar
[73] Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5:623-648.Google Scholar · Zbl 1222.68099
[74] Mansour Y, Slivkins A, Syrgkanis V, Wu S (2016) Bayesian exploration: Incentivizing exploration in Bayesian games. Preprint, submitted February 24, https://arxiv.org/abs/1602.07570.Google Scholar
[75] Maron O, Moore AW (1993) Hoeffding races: Accelerating model selection search for classification and function approximation. Advances in Neural Information Processing Systems (NIPS), vol. 6 (Curran Associates, Red Hook, NY), 59-66.Google Scholar
[76] Maron O, Moore AW (1997) The racing algorithm: Model selection for lazy learners. Artificial Intelligence Rev. 11(1-5):193-225.Crossref, Google Scholar
[77] Mitzenmacher M, Upfal E (2005) Probability and Computing: Randomized Algorithms and Probabilistic Analysis (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar · Zbl 1092.60001
[78] Parmar MKB, Carpenter J, Sydes MR (2014) More multiarm randomised trials of superiority are needed. Lancet 384(9940):283-284.Crossref, Google Scholar
[79] Raghavan M, Slivkins A, Vaughan JW, Wu ZS (2018) The externalities of exploration and how data diversity helps exploitation. Conf. Learn. Theory (COLT), 1724-1738.Google Scholar
[80] Rayo L, Segal I (2010) Optimal information disclosure. J. Political Econom. 118(5):949-987.Crossref, Google Scholar
[81] Redig A, Jänne P (2015) Basket trials and the evolution of clinical trial design in an era of genomic medicine. J. Clinical Oncology 33(9):975-977.Crossref, Google Scholar
[82] Schmit S, Riquelme C (2018) Human interaction with recommendation systems. Internat. Conf. Artificial Intelligence Statist. (AISTATS), 862-870.Google Scholar
[83] Singla A, Krause A (2013) Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. 22nd Internat. World Wide Web Conf. (WWW), Geneva, Switzerland, 1167-1178.Google Scholar
[84] Slivkins A, Vaughan JW (2013) Online decision making in crowdsourcing markets: Theoretical challenges. SIGecom Exchanges 12(2):4-23.Google Scholar
[85] Taneva I (2016) Information design. Working paper, University of Edinburgh, Edinburgh, UK.Google Scholar
[86] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3-4):285-294.Crossref, Google Scholar · JFM 59.1159.03
[87] Vaidyanathan G (2012) Redefining clinical trials: The age of personalized medicine. Cell 148(6):1079-1080.Crossref, Google Scholar
[88] Wang C-C, Kulkarni SR, Poor HV (2005) Bandit problems with side observations. IEEE Trans. Automat. Control. 50(3):338-355.Crossref, Google Scholar · Zbl 1366.91063
[89] Wen Z, Kveton B, Ashkan A (2015) Efficient learning in large-scale combinatorial semi-bandits. 32nd Internat. Conf. Machine Learn. (ICML), 1113-1122.Google Scholar
[90] Woodroofe M (1979) A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74(368):799-806.Crossref, · Zbl 0442.62063
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.