zbMATH — the first resource for mathematics

Bayesian classifiers based on kernel density estimation: flexible classifiers. (English) Zbl 1191.68600
Summary: When learning Bayesian network based classifiers continuous variables are usually handled by discretization, or assumed that they follow a Gaussian distribution. This work introduces the kernel based Bayesian network paradigm for supervised classification. This paradigm is a Bayesian network which estimates the true density of the continuous variables using kernels. Besides, tree-augmented naive Bayes, \(k\)-dependence Bayesian classifier and complete graph classifier are adapted to the novel kernel based Bayesian network paradigm. Moreover, the strong consistency properties of the presented classifiers are proved and an estimator of the mutual information based on kernels is presented. The classifiers presented in this work can be seen as the natural extension of the flexible naive Bayes classifier proposed by John and Langley [G. H. John and P. Langley, in: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, 338–345 (1995)], breaking with its strong independence assumption.
Flexible tree-augmented naive Bayes seems to have superior behavior for supervised classification among the flexible classifiers. Besides, flexible classifiers presented have obtained competitive errors compared with the state-of-the-art classifiers.

68T10 Pattern recognition, speech recognition
68T05 Learning and adaptive systems in artificial intelligence
C4.5; KernSmooth; UCI-ml
Full Text: DOI
[1] M. Aladjem. Projection pursuit fitting Gaussian mixture models, in: Proceedings of Joint IAPR, volume 2396 of Lecture Notes in Computer Science (2002) 396-404. · Zbl 1073.68726
[2] Aladjem, M., Projection pursuit mixture density estimation, IEEE transactions on signal processing, 53, 11, 4376-4383, (2005) · Zbl 1370.94062
[3] J. Bilmes, A gentle tutorial on the EM algorithm and its application to parameter estimation for gaussian mixture models, Technical Report ICSI-TR-97-021, University of Berkeley, 1997.
[4] Bishop, C.M., Neural networks for pattern recognition, (1995), Oxford University Press
[5] Bishop, C.M., Latent variable models, Learning in graphical models, 371-403, (1999) · Zbl 0948.62043
[6] Bishop, C.M., Pattern recognition and machine learning, Information science and statistics, (2006), Springer
[7] S.G. Bottcher, Learning Bayesian Networks with Mixed Variables, PhD thesis, Aalborg University, 2004.
[8] R. Bouckaert, Naive Bayes classifiers that perform well with continuous variables, In: Proceedings of the Seventeenth Australian Conference on Artificial Intelligence, 2004, pp. 1089-1094.
[9] G. Casella and R.L. Berger, Statistical Inference, Wadsworth and Brooks, 1990.
[10] Castillo, E.; Gutierrez, J.M.; Hadi, A.S., Expert systems and probabilistic network models, (1997), Springer-Verlag
[11] J. Cheng and R. Greiner. Comparing Bayesian network classifiers, in: Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, 1999, pp. 101-107.
[12] Chickering, D.M., Learning equivalence classes of Bayesian-network structures, Journal of machine learning research, 2, 445-498, (2002) · Zbl 1007.68179
[13] Chow, C.; Liu, C., Approximating discrete probability distributions with dependence trees, IEEE transactions on information theory, 14, 462-467, (1968) · Zbl 0165.22305
[14] Cormen, T.H.; Charles, L.E.; Ronald, R.L.; Clifford, S., Introductions to algorithms, (2003), MIT Press
[15] Cover, T.M.; Thomas, J.A., Elements of information theory, (1991), John Wiley and Sons · Zbl 0762.94001
[16] Cover, T.T.; Hart, P.E., Nearest neighbour pattern classification, IEEE transactions on information theory, 13, 21-27, (1967) · Zbl 0154.44505
[17] DeGroot, M., Optimal statistical decisions, (1970), McGraw-Hill New York · Zbl 0225.62006
[18] Delaigle, A.; Gijbels, I., Comparison of data-driven bandwidth selection procedures in deconvolution kernel density estimation, Computational statistics and data analysis, 39, 1-20, (2002)
[19] Demšar, J., Statistical comparisons of classifiers over multiple data sets, Journal of machine learning research, 7, 1-30, (2006) · Zbl 1222.68184
[20] Devroye, L.; weak, The equivalence in l1 of, Strong and complete convergence of kernel density estimates, Annals of statistics, 11, 896-904, (1983)
[21] Diamantidis, N.A.; Karlis, D.; Giakoumakis, E.A., Unsupervised stratification of cross-validation for accuracy estimation, Artificial intelligence, 116, 1-16, (2000) · Zbl 0939.68744
[22] P. Domingos, A unified bias-variance decomposition and its applications. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufman, 2000, pp 231-238.
[23] Domingos, P.; Pazzani, M., On the optimality of the simple Bayesian classifier under zero-one loss, Machine learning, 29, 103-130, (1997) · Zbl 0892.68076
[24] J. Dougherty, R. Kohavi, and M. Sahami, Supervised and unsupervised discretization of continuous features. in: Proceedings of the 12th International Conference on Machine Learning, 1995, pp. 194-202.
[25] Duda, R.; Hart, P., Pattern classification and scene analysis, (1973), John Wiley and Sons · Zbl 0277.68056
[26] Duda, R.; Hart, P.; Stork, D., Pattern classification, (2000), John Wiley and Sons
[27] U. Fayyad and K. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, in: Proceedings of the 13th International Conference on Artificial Intelligence, 1993, pp. 1022-1027.
[28] M.A.T. Figueiredo, J.M.N. Leitão, and A.K. Jain, On fitting mixture models. In Energy Minimization Methods in Computer Vision and Pattern Recognition, volume 1654 of Lecture Notes in Computer Science, 1999, pp. 732-749.
[29] Friedman, J.H., On bias, variance, 0/1 - loss, and the curse-of-dimensionality, Data mining and knowledge discovery, 1, 55-77, (1997)
[30] Friedman, N.; Geiger, D.; Goldszmidt, M., Bayesian network classifiers, Machine learning, 29, 131-163, (1997) · Zbl 0892.68077
[31] Fukunaga, K., Statistical pattern recognition, (1972), Academic Press Inc.
[32] D. Geiger and D. Heckerman, Learning Gaussian networks. Technical report, Microsoft Research, Advanced Technology Division, 1994.
[33] German, S.; Bienenstock, E.; Doursat, R., Neural networks and the bias-variance dilemma, Neural computation, 4, 1-58, (1992)
[34] Goldberg, D.E., Genetic algorithms in search, Optimization and machine learning, (1989), Addison-Wesley · Zbl 0721.68056
[35] Greiner, R.; Zhou, W.; Su, X.; Shen, B., Structural extension to logistic regression: discriminative parameter learning of belief net classifiers, Machine learning, 59, 3, 97-322, (2005) · Zbl 1101.68759
[36] Y. Gurwicz and B. Lerner, Rapid spline-based kernel density estimation for Bayesian networks, in: Proceedings of the 17th International Conference on Pattern Recognition, Vol. 3, 2004, pp. 700-703.
[37] Gurwicz, Y.; Lerner, B., Bayesian network classification using spline-aproximated kernel density estimation, Pattern recognition letters, 26, 1761-1771, (2005)
[38] Hand, D.J.; Yu, K., Idiot’s Bayes - not so stupid after all?, International statistical review, 69, 3, 385-398, (2001) · Zbl 1213.62010
[39] James, G.M., Variance and bias for general loss functions, Machine learning, 51, 115-135, (2003) · Zbl 1027.68067
[40] Jebara, T., Machine learning: discriminative and generative, (2004), Kluwer Academic Publishers. · Zbl 1030.68073
[41] G.H. John and P. Langley, Estimating continuous distributions in Bayesian classifiers, in: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, 1995, pp. 338-345.
[42] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: International Joint Conference on Artificial Intelligence, Vol. 14, 1995, pp. 1137-1145.
[43] R. Kohavi, Wrappers for Performance Enhancement and Oblivious Decision Graphs. PhD thesis, Stanford University, Computer Science Department, 1995.
[44] R. Kohavi, B. Becker, and D. Sommerfield, Improving simple Bayes. Technical report, Data Mining and Visualization Group, Silicon Graphics, 1997.
[45] Kohavi, R.; John, G., Wrappers for feature subset selection, Artificial intelligence, 97, 1-2, 273-324, (1997) · Zbl 0904.68143
[46] Kohavi, R.; Wolpert, D.H., Bias plus variance decomposition for zero-one loss functions, (), 275-283
[47] I. Kononenko. Semi-naive Bayesian classifiers, in: Proceedings of the 6th European Working Session on Learning, 1991, pp. 206-219.
[48] P. Langley, W. Iba, and K. Thompson, An analysis of Bayesian classifiers, in: Proceedings of the 10th National Conference on Artificial Intelligence, 1992, pp. 223-228.
[49] Larrañaga, P.; Lozano, J.A., Estimation of distribution algorithms, A new tool for evolutionary computation, (2002), Kluwer Academic Publishers. · Zbl 0979.00024
[50] Lauritzen, S.L., Graphical models, (1996), Oxford University Press · Zbl 0907.62001
[51] S.L. Lauritzen and N. Wermuth. Mixed interaction models, Technical report r 84-8, Institute for Electronic Systems, Aalborg University, 1984. · Zbl 0669.62045
[52] Lauritzen, S.L.; Wermuth, N., Graphical models for associations between variables, some of wich are qualitative and some quantitative, Annals of statistics, 17, (1989) · Zbl 0669.62045
[53] Lerner, B., Bayesian fluorescence in situ hybridisation signal classification, Artificial intelligence in medicine, 30, 3, 301-316, (2004)
[54] Lerner, B.; Lawrence, N.D., A comparison of state-of-the-art classification techniques with application to cytogenetics, Neural computing and applications, 10, 1, 39-47, (2001) · Zbl 1157.68506
[55] McLachlan, G.J.; Peel, D., Finite mixture models, Probability and mathematical statistics, (2000), John Wiley and Sons
[56] Minsky, M., Steps toward artificial intelligence, Transactions on institute of radio engineers, 49, 8-30, (1961)
[57] Moon, Y.; Rajagopalan, B.; Lall, U., Estimation of mutual information using kernel density estimators, Physical review, 52, 3, 2318-2321, (1995)
[58] S. Moral, R. Rumí, and A. Salmerón, Estimating mixtures of truncated exponential from data, in: First European Whorkshop on Probabilistic Graphical Models, 2002, pp. 156-167.
[59] P.M. Murphy and D.W. Aha, UCI repository of machine learning databases, Technical report, University of California at Irvine, http://www.ics.uci.edu/ mlearn, 1995.
[60] Neapolitan, R., Learning Bayesian networks, (2003), Prentice Hall
[61] A. Pérez, P. Larrañaga, and I. Inza, Information theory and classification error in probabilistic classifiers, in: Proceedings of the Ninth International Conference on Discovery Science. Lecture Notes in Artificial Intelligence, Vol. 4265, 2006, pp. 347-351.
[62] Pérez, A.; Larrañaga, P.; Inza, I., Supervised classification with conditional Gaussian networks: increasing the structure complexity from naive Bayes, International journal of approximate reasoning, 43, 1-25, (2006) · Zbl 1097.62057
[63] Parzen, E., On estimation of a probability density function and mode, Annals of mathematical statistics, 33, 3, 1065-1076, (1962) · Zbl 0116.11302
[64] M. Pazzani, Searching for dependencies in Bayesian classifiers, in: Learning from Data: Artificial Intelligence and Statistics V, 1997, pp. 239-248.
[65] Pearl, J., Probabilistic reasoning in intelligent systems: networks of plausible inference, (1988), Morgan Kaufman Publishers
[66] Quinlan, J.R., Induction of decision trees, Machine learning, 1, 81-106, (1986)
[67] Quinlan, J.R., C4.5: programs for machine learning, (1993), Morgan Kaufman
[68] Raudys, S., On the effectiveness of parzen window classifier, Informatica, 2, 3, 434-453, (1991) · Zbl 0904.68154
[69] Romero, V.; Rumí, R.; Salmeron, A., Learning hybrid Bayesian networks using mixture of truncate exponentials, International journal of approximate reasoning, 42, 54-68, (2006) · Zbl 1096.68707
[70] Roos, T.; Wettig, H.; Grünwald, P.; Myllymäki, P.; Tirri, H., On discriminative Bayesian network classifiers and logistic regression, Machine learning, 59, 3, 267-296, (2005) · Zbl 1101.68785
[71] Rosenblatt, F., Remarks on some nonparametric estimates of a density function, Annals of mathematical statistics, 27, 832-837, (1956) · Zbl 0073.14602
[72] Rosenblatt, F., Principles of neurodynamics, (1959), Spartan Books · Zbl 0143.43504
[73] M. Sahami, Learning limited dependence Bayesian classifiers, in: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996, pp. 335-338.
[74] G. Santafé, J.A. Lozano, and P. Larrañaga, Discriminative learning of Bayesian network classifiers via the TM algorithm, in: Proceedings of the Eighth European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, 2005, pp. 148-160.
[75] Scott, D.W., Multivariate density estimation: theory, Practice and visualization, (1992), John Wiley and Sons · Zbl 0850.62006
[76] Scott, D.W.; Szewczyk, W.F., From kernels to mixtures, Technometrics, 43, 3, 323-335, (2001)
[77] Silverman, B.W., Density estimation for statistics and data analysis, (1986), Chapman and Hall London · Zbl 0617.62042
[78] Simonoff, J.S., Smoothing methods in statistics, (1996), Springer · Zbl 0859.62035
[79] van der Putten, P.; van Someren, M., A bias-variance analysis of a real world learning problem: the coil challenge 2000, Machine learning, 57, 177-195, (2004) · Zbl 1078.68738
[80] Wand, M.P.; Jones, M.C., Kernel smoothing, Monographs on statistics and applied probability, (1995), Chapman and Hall · Zbl 0854.62043
[81] Witten, I.H.; Frank, E., Data mining: practical machine learning tools and techniques, (2005), Morgan Kaufman · Zbl 1076.68555
[82] Y. Yang and G.I. Webb, Discretization for naive-Bayes learning: Managing discretization bias and variance, Technical report 2003-131, School of Computer Science and Software Engineering, Monash University, 2003.
[83] A. Zhou, Z. Cai, and L. Wei, M-kernel merging: Towards density estimation over data streams, in: Proceedings of the Database Systems for Advanced Applications, 2003, pp. 285-292.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.