The weighted majority algorithm. (English) Zbl 0804.68121

Summary: We study the construction of prediction algorithms in a situation in which a learner faces a sequence of trials, with a prediction to be made in each, and the goal of the learner is to make few mistakes. We are interested in the case, where the learner has reason to believe that one of some pool of known algorithms will perform well, but the learner does not know which one. A simple and effective method, based on weighted voting, is introduced for constructing a compound algorithm in such a circumstance. We call this method the Weighted Majority Algorithm. We show that this algorithm is robust in the presence of errors in the data. We discuss various versions of the Weighted Majority Algorithm and prove mistake bounds for them that are closely related to the mistake bounds of the best algorithms of the pool. For example, given a sequence of trials, if there is an algorithm in the pool \(\mathcal A\) that makes at most \(m\) mistakes then the Weighted Majority Algorithm will make at most \(c(\log| {\mathcal A}|+ m)\) mistakes on that sequence, where \(c\) is fixed constant.


68T05 Learning and adaptive systems in artificial intelligence
68Q25 Analysis of algorithms and problem complexity
91B12 Voting theory
Full Text: DOI Link


[1] Quinlan J R. Introduction of decision trees. Mach Learn, 1986, 1: 81-106
[2] Quinlan J R. C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann, 1993
[3] Cohen W. Fast effective rule introduction. In: Proceedings of ICML-95. San Fransisco: Morgan Kaufmann, 1995. 115-123
[4] Littlestone N, Warmuth M. The weighted majority algorithm. Inform Comput, 1994, 108: 212-261 · Zbl 0804.68121
[5] Freund Y, Schapire R. Large margin classification using the perceptron algorithm. Mach Learn, 1999, 37: 277-296 · Zbl 0944.68535
[6] Rumelhart D E, Hinton G E, Williams R J. Learning internal representations by error propagation. In: Rumelhart D E, McClelland J L, eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge: MIT Press, 1986. 318-362
[7] Cestnik B. Estimating probabilities: a crucial task in machine learning. In: Proceedings of the European Conference on Artificial Intelligence. Stockholm, 1990. 147-149
[8] Friedman J H. Regularized discriminant analysis. J Am Stat Assoc, 1989, 84: 165-175
[9] Jensen F. An Introduction to Bayesian Networks. New York: Springer, 1996
[10] Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inform Theory, 1967, 13: 7-21 · Zbl 0154.44505
[11] Kubat, Martin M C. A reduction technique for nearest-neighbor classification: small groups of examples. Intell Data Anal, 2001, 5: 463-476 · Zbl 1088.68766
[12] Peleg D, Meir R. A sparsity driven kernel machine based on minimizing a generalization error bound. Pattern Recogn, 2009, 42: 2607-2614 · Zbl 1175.68385
[13] Wright J, Yang A Y, Ganesh A, et al. Robust face recognition via sparse representation. IEEE Trans Pattern Anal, 2009, 31: 210-227
[14] Donoho D L, Elad E. Maximal sparsity representation via \(L_1\) minimization. Proc Natl Acal Sci, 2003, 100: 2197-2202 · Zbl 1064.94011
[15] Katz M, Schaffoner M, Andelic E, et al. Sparse kernel logistic regression using incremental feature selection for textindependent speaker indentification. In: IEEE Odyssey 2006: the Speaker and Language Recognition Workshop. San Juan: IEEE, 2006. 1-6
[16] Krishnapuram B, Carin L, Mario A T, et al. Hartemink, sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans Pattern Anal, 2005, 27: 957-967
[17] Zhu J, Hastie T. Kernel logistic regression and the import vector machine. J Comput Graph Stat, 2005, 14: 185-205
[18] Liu Y F, Zhang H H, Ahn J, et al. Support vector machines with adaptive Lq penalty. Comput Stat Data An, 2007, 51: 6380-6394 · Zbl 1446.62179
[19] Jaakkola T, Haussler D. Probabilistic kernel regression models. In: Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics. San Francisco: Morgan Kaufmann, 1999
[20] Roth V. Probabilistic discriminative kernel classifiers for multi-class problems. In: Radig B, Florczyk S, eds. Pattern recognition-DAGM’01. London: Springer-Verlag, 2001. 246-253 · Zbl 1038.68842
[21] Lee S-I, Lee H, Abbeel P, et al. Efficient \(L_1\) regularized logistic regression. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06). California: AAAI Press, 2006
[22] Bradley P S, Mangasarian O L. Feature selection via concave minimization and support vector machines. In: Proceedings of 13th ICML. San Fransisco: Morgan Kaufmann, 1998. 82-90
[23] Liu Z Q, Jiang F, Tian G L, et al. Sparse logistic regression with \(L_p\) penalty for biomarker identification. Stat Appl Gen Molec Biol, 2007, 6: Article 6 · Zbl 1166.62314
[24] Zhu J, Rosset S, Hastie T, et al. 1-norm support vector machines. In: Neural Information Processing Systems. Cambridge: MIT Press, 2003
[25] Zou H. An improved 1-norm SVM for simulation classification and variable selection. In: Meila M, Shen X, eds. Proceedings of the 11th International Conference on Artificial Intelligence and Statistics. Puerto Rico, 2007. 675-681
[26] Candès E, Tao T. Near-optimal signal recovery from random projections: universal encoding strategies? IEEE Trans Inform Theory, 2006, 52: 5406-5425 · Zbl 1309.94033
[27] Xu Z B, Zhang H, Wang Y, et al. \(L_{1/2}\) regularizer. Sci China Ser F-Inf Sci, 2009, 52: 1-9
[28] Blumensath T, Davies M E. Iterative thresholding for sparse approximations. J Fourier Anal Appl, 2008, 14: 629-654 · Zbl 1175.94060
[29] Xu Z B, Chang X Y, Xu F M, et al. \(L_{1/2}\) regularization: a thresholding representation theory and a fast solver. IEEE Trans Neural Networ Learn Syst, 2012, 23: 1013-1027
[30] Bregman L. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput Math Math Phys, 1967, 7: 200-217
[31] Ripely B D. Neural networks and related method for classification. J Roy Stat Soc-Ser B, 1994, 56: 409-456
[32] Zhou D, Bousquet O, Lal T N, et al. Learning with local and global consistency. In: Advances in Neural Information Processing Systems 16. Cambridge: MIT Press, 2004. 321-328
[33] Sammaria F,
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.