Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach. (English) Zbl 1231.62030

Summary: In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.


62F15 Bayesian inference
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62-07 Data analysis (statistics) (MSC2010)
65C60 Computational problems in statistics (MSC2010)


Full Text: DOI


[1] Abramowitz M, Stegun I (1970) Handbook of mathematical functions. Dover, New York · Zbl 0171.38503
[2] Bay S (2001) Multivariate discretization for set mining. Mach Learn 3(4): 491–512 · Zbl 0987.68633
[3] Berger J (2006) The case of objective Bayesian analysis. Bayesian Anal 1(3): 385–402 · Zbl 1331.62042
[4] Bernardo J, Smith A (2000) Bayesian theory. Wiley, New York · Zbl 0943.62009
[5] Bertier P, Bouroche J (1981) Analyse des données multidimensionnelles. Presses Universitaires de France
[6] Blake C, Merz C (1996) UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html
[7] Boullé M (2004) Khiops: a statistical discretization method of continuous attributes. Mach Learn 55(1): 53–69 · Zbl 1067.68121
[8] Boullé M (2005) A Bayes optimal approach for partitioning the values of categorical attributes. J Mach Learn Res 6: 1431–1452 · Zbl 1222.68153
[9] Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1): 131–165 · Zbl 1470.68086
[10] Boullé M (2007) Compression-based averaging of selective naive Bayes classifiers. J Mach Learn Res 8: 1659–1685 · Zbl 1222.62035
[11] Boullé M (2008) Bivariate data grid models for supervised learning. Technical Report NSM/R&D/ TECH/EASY/TSI/4/MB, France Telecom R&D. http://perso.rd.francetelecom.fr/boulle/publications/BoulleNTTSI4MB08.pdf
[12] Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International, California · Zbl 0541.62042
[13] Carr D, Littlefield R, Nicholson W, Littlefield J (1987) Scatterplot matrix techniques for large n. J Am Stat Assoc 82: 424–436
[14] Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R (2000) CRISP-DM 1.0: step-by-step data mining guide
[15] Cochran W (1954) Some methods for strengthening the common chi-squared tests. Biometrics 10(4): 417–451 · Zbl 0059.12803
[16] Connor-Linton J (2003) Chi square tutorial. http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html
[17] Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102 · Zbl 0767.68084
[18] Goldstein M (2006) Subjective Bayesian analysis: principles and practice. Bayesian Anal 1(3): 403–420 · Zbl 1331.62047
[19] Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3: 1157–1182 · Zbl 1102.68556
[20] Guyon I, Gunn S, Hur AB, Dror G (2006) Design and analysis of the NIPS2003 challenge. In: Guyon I, Gunn S, Nikravesh M, Zadeh L (eds) Feature extraction: foundations and applications, chap 9. Springer, New York, pp 237–263
[21] Hansen P, Mladenovic N (2001) Variable neighborhood search: principles and applications. Eur J Oper Res 130: 449–467 · Zbl 0981.90063
[22] Holte R (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11: 63–90 · Zbl 0850.68278
[23] Kass G (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29(2): 119–127
[24] Kerber R (1992) ChiMerge discretization of numeric attributes. In: Proceedings of the 10th international conference on artificial intelligence. MIT Press, Cambridge, pp 123–128
[25] Kohavi R, John G (1997) Wrappers for feature selection. Artif Intell 97(1-2): 273–324 · Zbl 0904.68143
[26] Kohavi R, Sahami M (1996) Error-based and entropy-based discretization of continuous features. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 114–119
[27] Kononenko I, Bratko I, Roskar E (1984) Experiments in automatic learning of medical diagnostic rules. Technical report, Joseph Stefan Institute, Faculty of Electrical Engineering and Computer Science, Ljubljana
[28] Kurgan L, Cios J (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2): 145–153 · Zbl 05108900
[29] Kwedlo W, Kretowski M (1999) An evolutionary algorithm using multivariate discretization for decision rule induction. In: Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1704. Springer, Berlin, 392–397
[30] Langley P, Iba W, Thompson K (1992) An analysis of Bayesian classifiers. In: 10th National conference on artificial intelligence. AAAI Press, San Jose, pp 223–228
[31] Maass W (1994) Efficient agnostic pac-learning with simple hypothesis. In: COLT ’94: Proceedings of the seventh annual conference on Computational learning theory. ACM Press, New York, pp 67–75
[32] Nadif M, Govaert G (2005) Block clustering of contingency table and mixture model. In: Advances in intelligent data analysis VI. Lecture notes in computer science, vol 3646. Springer, Berlin, pp 249–259 · Zbl 1165.68418
[33] Olszak M, Ritschard G (1995) The behaviour of nominal and ordinal partial association measures. The Statistician 44(2): 195–212
[34] Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Francisco
[35] Quinlan J (1986) Induction of decision trees. Mach Learn 1: 81–106
[36] Quinlan J (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco
[37] Rissanen J (1978) Modeling by shortest data description. Automatica 14: 465–471 · Zbl 0418.93079
[38] Ritschard G, Nicoloyannis N (2000) Aggregation and association in cross tables. In: PKDD ’00: proceedings of the 4th European conference on principles of data mining and knowledge discovery. Springer, Berlin, pp 593–598
[39] Robert C (1997) The Bayesian choice: a decision-theoretic motivation. Springer, New York
[40] Saporta G (1990) Probabilités analyse des données et statistique. TECHNIP, Paris · Zbl 0703.62003
[41] Shannon C (1948) A mathematical theory of communication. Technical Report 27, Bell systems technical journal · Zbl 1154.94303
[42] Steck H, Jaakkola T (2004) Predictive discretization during model selection. Pattern Recognit LNCS 3175: 1–8
[43] Weaver W, Shannon C (1949) The mathematical theory of communication. University of Illinois Press, Urbana · Zbl 0041.25804
[44] Zighed D, Rakotomalala R (2000) Graphes d’induction. Hermes, France
[45] Zighed D, Rabaseda S, Rakotomalala R (1998) Fusinter: a method for discretization of continuous attributes for supervised learning. Int J Uncertain Fuzziness Knowl Based Syst 6(33): 307–326 · Zbl 1087.68629
[46] Zighed D, Ritschard G, Erray W, Scuturici V (2005) Decision trees with optimal joint partitioning. Int J Intell Syst 20(7): 693–718 · Zbl 1101.68535
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.