×

Nonstochastic multi-armed bandits with graph-structured feedback. (English) Zbl 1375.68097

Summary: We introduce and study a partial-information model of online learning, where a decision maker repeatedly chooses from a finite set of actions and observes some subset of the associated losses. This setting naturally models several situations where knowing the loss of one action provides information on the loss of other actions. Moreover, it generalizes and interpolates between the well-studied full-information setting (where all losses are revealed) and the bandit setting (where only the loss of the action chosen by the player is revealed). We provide several algorithms addressing different variants of our setting and provide tight regret bounds depending on combinatorial properties of the information feedback structure.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68T20 Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)
68W27 Online algorithms; streaming algorithms
68W40 Analysis of algorithms
91A80 Applications of game theory

Software:

AdaBoost.MH
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] N. Alon, N. Cesa-Bianchi, O. Dekel, and T. Koren, {\it Online learning with feedback graphs: Beyond bandits}, in Proceedings of the 28th Conference on Learning Theory, COLT 2015, Paris, France, 2015, pp. 23-25.
[2] N. Alon, N. Cesa-Bianchi, O. Dekel, and T. Koren, {\it Online Learning with Feedback Graphs: Beyond Bandits}, preprint, , 2015.
[3] N. Alon, N. Cesa-Bianchi, C. Gentile, and Y. Mansour, {\it From bandits to experts: A tale of domination and independence}, in Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS 2012), Lake Tahoe, NV, 2013, pp. 1610-1618.
[4] N. Alon and J. H. Spencer, {\it The Probabilistic Method}, John Wiley and Sons, New York, 2004. · Zbl 1333.05001
[5] J.-Y. Audibert and S. Bubeck, {\it Minimax policies for adversarial and stochastic bandits}, in Proceedings of COLT 2009, Montreal, Canada, 2009, online proceedings.
[6] J.-Y. Audibert and S. Bubeck, {\it Regret bounds and minimax policies under partial monitoring}, J. Mach. Learn. Res., 11 (2010), pp. 2785-2836. · Zbl 1242.91034
[7] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, {\it The nonstochastic multiarmed bandit problem}, SIAM J. Comput., 32 (2002), pp. 48-77, . · Zbl 1029.68087
[8] G. Bartók, D. P. Foster, D. Pál, A. Rakhlin, and C. Szepesvári, {\it Partial monitoring–classification, regret bounds, and algorithms}, Math. Oper. Res., 39 (2014), pp. 967-997. · Zbl 1310.91028
[9] S. Bubeck and N. Cesa-Bianchi, {\it Regret analysis of stochastic and nonstochastic multi-armed bandit problems}, Found. Trends Mach. Learn., 5 (2012), pp. 1-122. · Zbl 1281.91051
[10] S. Buccapatnam, A. Eryilmaz, and N. B. Shroff, {\it Stochastic bandits with side observations on networks}, ACM SIGMETRICS Perform. Eval. Rev., 42 (2014), pp. 289-300.
[11] Y. Caro, {\it New Results on the Independence Number}, Tech. report, Tel Aviv University, Tel Aviv, Israel, 1979.
[12] S. Caron, B. Kveton, M. Lelarge, and S. Bhagat, {\it Leveraging Side Observations in Stochastic Bandits}, preprint, , 2012.
[13] S. Caron, B. Kveton, M. Lelarge, and S. Bhagat, {\it Leveraging side observations in stochastic bandits}, in Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, CA, 2012, pp. 142-151.
[14] A. Carpentier and M. Valko, {\it Revealing graph bandits for maximizing local influence}, in Proceedings of the International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, 2016, pp. 10-18.
[15] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth, {\it How to use expert advice}, J. ACM, 44 (1997), pp. 427-485. · Zbl 0890.68066
[16] N. Cesa-Bianchi, P. Gaillard, C. Gentile, and S. Gerchinovitz, {\it Algorithmic chaining and the role of partial feedback in online nonparametric learning}, in Proceedings of the 30th Conference on Learning Theory (COLT), JMLR Workshop and Conference Proceedings Volume 65, Amsterdam, The Netherlands, 2017, pp. 465-481.
[17] N. Cesa-Bianchi and G. Lugosi, {\it Prediction, Learning, and Games}, Cambridge University Press, Cambridge, UK, 2006. · Zbl 1114.91001
[18] N. Cesa-Bianchi and G. Lugosi, {\it Combinatorial bandits}, J. Comput. System Sci., 78 (2012), pp. 1404-1422. · Zbl 1262.91052
[19] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz, {\it Improved second-order bounds for prediction with expert advice}, in Proceedings of the 18th Annual Conference on Learning Theory, Bertinoro, Italy, 2005, pp. 217-232. · Zbl 1137.68525
[20] V. Chvatal, {\it A greedy heuristic for the set-covering problem}, Math. Oper. Res., 4 (1979), pp. 233-235. · Zbl 0443.90066
[21] A. Cohen, T. Hazan, and T. Koren, {\it Online Learning with Feedback Graphs without the Graphs}, preprint, , 2016.
[22] O. Dekel, A. Tewari, and R. Arora, {\it Online bandit learning against an adaptive adversary: From regret to policy regret}, in Proceedings of ICML 2012, Edinburgh, Scotland, 2012, pp. 1503-1510.
[23] D. Freedman, {\it On tail probabilities for martingales}, Ann. Probab., 3 (1975), pp. 100-118. · Zbl 0313.60037
[24] Y. Freund and R. E. Schapire, {\it A decision-theoretic generalization of on-line learning and an application to boosting}, in Proceedings of the 2nd Annual Conference on Computational Learning Theory (Euro-COLT ’95) (Barcelona, 1995), J. Comput. System Sci., 55 (1997), pp. 119-139. · Zbl 0880.68103
[25] A. Frieze and C. McDiarmid, {\it Algorithmic theory of random graphs}, Random Structures Algorithms, 10 (1997), pp. 5-42. · Zbl 0868.05048
[26] A. M. Frieze, {\it On the independence number of random graphs}, Discrete Math., 81 (1990), pp. 171-175. · Zbl 0712.05052
[27] M. K. Hanawal and V. Saligrama, {\it Cost effective algorithms for spectral bandits}, in Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton 2015), Monticello, IL, 2015, pp. 1323-1329.
[28] A. Kalai and S. Vempala, {\it Efficient algorithms for online decision problems}, J. Comput. System Sci., 71 (2005), pp. 291-307. · Zbl 1094.68112
[29] D. Kempe, J. Kleinberg, and É. Tardos, {\it Maximizing the spread of influence through a social network}, in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, 2003, pp. 137-146. · Zbl 1337.91069
[30] D. Kempe, J. Kleinberg, and É. Tardos, {\it Maximizing the spread of influence through a social network}, Theory Comput., 11 (2015), pp. 105-147. · Zbl 1337.91069
[31] T. Kocák, G. Neu, and M. Valko, {\it Online learning with Erdös-Rényi side-observation graphs}, in Proceedings of the Conference on Uncertainty in Artificial Intelligence, New York, 2016, pp. 339-346.
[32] T. Kocák, G. Neu, and M. Valko, {\it Online learning with noisy side observations}, in Proceedings of the International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, 2016, pp. 1186-1194.
[33] T. Kocák, G. Neu, M. Valko, and R. Munos, {\it Efficient learning by implicit exploration in bandit problems with side observations}, in Proceedings of the Conference on Advances in Neural Information Processing Systems, Montreal, Canada, 2014, pp. 613-621.
[34] J. Langford and T. Zhang, {\it The epoch-greedy algorithm for multi-armed bandits with side information}, in Proceedings of the Conference on Advances in Neural Information Processing Systems, Vancouver, Canada, 2008, pp. 817-824.
[35] N. Littlestone and M. K. Warmuth, {\it The weighted majority algorithm}, Inform. Comput., 108 (1994), pp. 212-261. · Zbl 0804.68121
[36] O. Maillard and R. Munos, {\it Adaptive bandits: Towards the best history-dependent strategy}, in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR Workshop and Conference Proceedings Volume 15, Fort Lauderdale, FL, 2011, pp. 570-578.
[37] S. Mannor and O. Shamir, {\it From bandits to experts: On the value of side-observations}, in Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS 2011), Granada, Spain, 2011, pp. 684-692.
[38] G. Neu, {\it Explore no more: Improved high-probability regret bounds for non-stochastic bandits}, in Proceedings of the Conference on Advances in Neural Information Processing Systems, Montreal, Canada, 2015, pp. 3150-3158.
[39] V. Perchet and P. Rigollet, {\it The multi-armed bandit problem with covariates}, Ann. Statist., 41 (2013), pp. 693-721. · Zbl 1360.62436
[40] P. Rusmevichientong and J. Tsitsiklis, {\it Linearly parameterized bandits}, Math. Oper. Res., 35 (2010), pp. 395-411. · Zbl 1217.93190
[41] A. Said, E. W. De Luca, and S. Albayrak, {\it How social relationships affect user similarities}, in Proceedings of the International Conference on Intelligent User Interfaces Workshop on Social Recommender Systems, Hong Kong, 2010, online proceedings.
[42] A. Slivkins, {\it Contextual bandits with similarity information}, in Proceedings of the 24th Annual Conference on Learning Theory (COLT), JMLR Workshop and Conference Proceedings Volume 19, Budapest, Hungary, 2011, pp. 679-702.
[43] G. Stoltz, {\it Information Incomplète et Regret Interne en Prédiction de Suites Individuelles}, Ph.D. thesis, Université Paris-XI Orsay, Orsay, France, 2005.
[44] M. Valko, R. Munos, B. Kveton, and T. Kocák, {\it Spectral bandits for smooth graph functions}, in Proceedings of the International Conference on Machine Learning, Beijing, China, 2014, pp. 46-54. · Zbl 07306905
[45] V. G. Vovk, {\it Aggregating strategies}, in Proceedings of the Conference on Learning Theory (COLT), Rochester, NY, 1990, pp. 371-386.
[46] C. Wang, S. Kulkarni, and H. Poor, {\it Arbitrary side observations in bandit problems}, Adv. Appl. Math., 34 (2005), pp. 903-938. · Zbl 1152.91391
[47] C. Wang, S. Kulkarni, and H. Poor, {\it Bandit problems with side observations}, IEEE Trans. Automat. Control, 50 (2005), pp. 338-355. · Zbl 1366.91063
[48] V. K. Wey, {\it A Lower Bound on the Stability Number of a Simple Graph}, Bell Lab. Tech. Memo 81-11217-9, 1981.
[49] Y. Wu, A. György, and C. Szepesvári, {\it Online learning with Gaussian payoffs and side observations}, in Proceedings of the Conference on Advances in Neural Information Processing Systems, Montreal, Canada, 2015, pp. 1360-1368.
[50] N. Zolghadr, G. Bartók, R. Greiner, A. György, and C. Szepesvári, {\it Online learning with costly features and labels}, in Proceedings of the Conference on Advances in Neural Information Processing Systems, Lake Tahoe, NV, 2013, pp. 1241-1249.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.