zbMATH — the first resource for mathematics

Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. (English) Zbl 1440.62237
Summary: The machine learning community adopted the use of null hypothesis significance testing (NHST) in order to ensure the statistical validity of results. Many scientific fields however realized the shortcomings of frequentist reasoning and in the most radical cases even banned its use in publications. We should do the same: just as we have embraced the Bayesian paradigm in the development of new machine learning methods, so we should also use it in the analysis of our own results. We argue for abandonment of NHST by exposing its fallacies and, more importantly, offer better – more sound and useful – alternatives for it.

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H20 Measures of association (correlation, canonical correlation, etc.)
62F03 Parametric hypothesis testing
62F15 Bayesian inference
Stan; JAGS; BayesDA; PMTK; LePAC
PDF BibTeX Cite
Full Text: Link
[1] Murray Aitkin. Posterior Bayes factors. Journal of the Royal Statistical Society. Series B (Methodological), pages 111–142, 1991. · Zbl 0800.62167
[2] M.J. Bayarri and James O Berger. Hypothesis Testing and Model Uncertainty. In Bayesian Theory and Applications, pages 365–394. OUP Oxford, 2013.
[3] Alessio Benavoli and Cassio P. Campos. Advanced Methodologies for Bayesian Networks: Second International Workshop, AMBN 2015, Yokohama, Japan, November 16-18, 2015. Proceedings, chapter Statistical tests for joint analysis of performance measures. Springer International Publishing, Cham, 2015.
[4] Alessio Benavoli, Francesca Mangili, Giorgio Corani, Marco Zaffalon, and Fabrizio Ruggeri. A Bayesian Wilcoxon signed-rank test based on the Dirichlet process. In Proceedings of the 30th International Conference on Machine Learning (ICML 2014), pages 1–9, 2014.
[5] Alessio Benavoli, Giorgio Corani, Francesca Mangili, and Marco Zaffalon. A Bayesian nonparametric procedure for comparing algorithms. In Proceedings of the 31th International Conference on Machine Learning (ICML 2015), pages 1–9, 2015a. · Zbl 1440.62241
[6] Alessio Benavoli, Francesca Mangili, Fabrizio Ruggeri, and Marco Zaffalon.Imprecise Dirichlet process with application to the hypothesis test on the probability that X≤Y. Journal of Statistical Theory and Practice, 9(3):658–684, 2015b.
[7] Alessio Benavoli, Giorgio Corani, and Francesca Mangili. Should we really use post-hoc tests based on mean-ranks? Journal of Machine Learning Research, 17(5):1–10, 2016. · Zbl 1360.62208
[8] James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics, New York, 1985. 33 · Zbl 0572.62008
[9] James O. Berger and Luis R Pericchi. The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association, 91(433):109–122, 1996. · Zbl 0870.62021
[10] James O. Berger and Thomas Sellke. Testing a point null hypothesis: The irreconcilability of p-values and evidence. Journal of the American statistical Association, 82(397):112–122, 1987. · Zbl 0612.62022
[11] James O. Berger, E. Moreno, L. R. Pericchi, M. J. Bayarri, Bernardo, et al. An overview of robust Bayesian analysis. Test, 3(1):5–124, 1994.
[12] Jos´e M Bernardo and Adrian FM Smith. Bayesian theory, volume 405. Wiley Chichester, 2009.
[13] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2007. · Zbl 1107.68072
[14] Remco R Bouckaert. Choosing between two learning algorithms based on calibrated tests. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 51–58, 2003.
[15] Bob Carpenter, Daniel Lee, Marcus A Brubaker, Allen Riddell, Andrew Gelman, Ben Goodrich, Jiqiang Guo, Matt Hoffman, Michael Betancourt, and Peter Li.Stan: A probabilistic programming language. Journal of Statistical Software, in press, 2016.
[16] Giorgio Corani and Alessio Benavoli. A Bayesian approach for comparing cross-validated algorithms on multiple data sets. Machine Learning, 100(2):285–304, 2015. doi: 10.1080/ s10994-015-5486-z. · Zbl 1341.62088
[17] Giorgio Corani, Alessio Benavoli, Janez Demsar, Francesca Mangili, and Marco Zaffalon. Statistical comparison of classifiers through Bayesian hierarchical modelling. Machine Learning in press, 2017. doi: 10.1007/s10994-017-5641-9. · Zbl 1440.62241
[18] Janez Demˇsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. · Zbl 1222.68184
[19] Janez Demˇsar. On the appropriateness of statistical tests in machine learning. In Workshop on Evaluation Methods for Machine Learning in conjunction with ICML, 2008.
[20] James Dickey. Scientific reporting and personal probabilities: Student’s hypothesis. Journal of the Royal Statistical Society. Series B (Methodological), pages 285–305, 1973.
[21] Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, 1998.
[22] Ward Edwards, Harold Lindman, and Leonard J Savage. Bayesian statistical inference for psychological research. Psychological review, 70(3):193, 1963. · Zbl 0173.22004
[23] Andrew Gelman. Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian analysis, 1(3):515–534, 2006. 34 · Zbl 1331.62139
[24] Andrew Gelman, Jennifer Hill, and Masanao Yajima. Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness, 5 (2):189–211, 2012.
[25] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. CRC press, 2013. · Zbl 1279.62004
[26] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian Data Analysis, volume 2. Taylor & Francis, 2014. · Zbl 1279.62004
[27] Myles Hollander, Douglas A Wolfe, and Eric Chicken. Nonparametric Statistical Methods, volume 751. John Wiley & Sons, 2013. · Zbl 1279.62006
[28] Miguel A Ju´arez and Mark FJ Steel. Model-based clustering of non-gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics, 28(1):52–66, 2010. · Zbl 1198.62097
[29] Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the American Statistical Association, 90(430):773–795, 1995. · Zbl 0846.62028
[30] John K. Kruschke. Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science, 1(5):658–676, 2010.
[31] John K. Kruschke. Bayesian estimation supersedes the t-test. Journal of Experimental Psychology: General, 142(2):573, 2013.
[32] John K. Kruschke. Doing Bayesian Data Analysis: A Tutorial with R, Jags and Stan. Academic Press, 2015. · Zbl 1300.62001
[33] John K. Kruschke and Torrin M Liddell. The Bayesian New Statistics: Two Historical Trends Converge. Available at SSRN 2606016, 2015.
[34] Alexandre Lacoste, Fran¸cois Laviolette, and Mario Marchand.Bayesian comparison of machine learning algorithms on single and multiple datasets. In Proc.of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), pages 665–675, 2012.
[35] Bruno Lecoutre and Jacques Poitevineau. The Significance Test Controversy Revisited. Springer, 2014. · Zbl 1341.62035
[36] Francesca Mangili, Alessio Benavoli, Cassio P. de Campos, and Marco Zaffalon. Reliable survival analysis based on the Dirichlet Process. Biometrical Journal, 57:10021019, 2015. · Zbl 1386.62057
[37] Kevin P Murphy. Machine Learning: a Probabilistic Perspective. MIT press, 2012. · Zbl 1295.68003
[38] Claude Nadeau and Yoshua Bengio. Inference for the generalization error. Machine Learning, 52(3):239–281, 2003. · Zbl 1039.68104
[39] Steven L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:317–328, 1997. 35
[40] Zbynek Sidak, Pranab Sen, and Jaroslav Hajek. Theory of Rank Tests. Probability and Mathematical Statistics. Elsevier Science, 1999. · Zbl 0944.62045
[41] David Trafimow and Michael Marks. Editorial. Basic and applied social psychology, 37(1): 1–2, 2015.
[42] Peter Walley. Inferences from multinomial data: learning about a bag of marbles. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):3–57, 1996. · Zbl 0834.62004
[43] Ronald L Wasserstein and Nicole A Lazar. The ASA’s statement on p-values: context, process, and purpose. The American Statistician, (just-accepted):00–00, 2016.
[44] Peter H Westfall, S Stanley Young, and S Paul Wright. On adjusting p-values for multiplicity. Biometrics, 49(3):941–945, 1993.
[45] Ian H Witten, Eibe Frank, and Mark Hall. Data Mining: Practical Machine Learning Tools and Techniques (third edition). Morgan Kaufmann, 2011. · Zbl 1076.68555
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.