×

Evaluation and optimization of frequent, closed and maximal association rule based classification. (English) Zbl 1322.62039

Summary: Real world applications of association rule mining have well-known problems of discovering a large number of rules, many of which are not interesting or useful for the application at hand. The algorithms for closed and maximal itemsets mining significantly reduce the volume of rules discovered and complexity associated with the task, but the implications of their use and important differences with respect to the generalization power, precision and recall when used in the classification problem have not been examined. In this paper, we present a systematic evaluation of the association rules discovered from frequent, closed and maximal itemset mining algorithms, combining common data mining and statistical interestingness measures, and outline an appropriate sequence of usage. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided as a whole and w.r.t individual classes. Empirical results confirm that with a proper combination of data mining and statistical analysis, a large number of nonsignificant, redundant and contradictive rules can be eliminated while preserving relatively high precision and recall. More importantly, the results reveal the important characteristics and differences between using frequent, closed and maximal itemsets for the classification task, and the effect of incorporating statistical/heuristic measures for optimizing such rule sets. With closed itemset mining already being a preferred choice for complexity and redundancy reduction during rule generation, this study has further confirmed that overall closed itemset based association rules are also of better quality in terms of classification precision and recall, and precision and recall on individual class examples. On the other hand maximal itemset based association rules, that are a subset of closed itemset based rules, show to be insufficient in this regard, and typically will have worse recall and generalization power. Empirical results also show the downfall of using the confidence measure at the start to generate association rules, as typically done within the association rule framework. Removing rules that occur below a certain confidence threshold, will also remove the knowledge of existence of any contradictions in the data to the relatively higher confidence rules, and thus precision can be increased by disregarding contradictive rules prior to application of confidence constraint.

MSC:

62-07 Data analysis (statistics) (MSC2010)

Software:

UCI-ml; CMAR
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Agrawal, R.; Srikant, R., Fast algorithms for mining association rules in large databases, Santiago, Chile
[2] Agrawal, R.; Imieliski, T.; Swami, A., Mining association rules between sets of items in large databases, Washington, DC, May 16-18
[3] Agresti, A.: An Introduction to Categorical Data Analysis, 2nd edn. Wiley, New York (2007) · Zbl 1266.62008 · doi:10.1002/0470114754
[4] AidIn, T.; Güvenir, H. A., Modeling interestingness of streaming association rules as a benefit-maximizing classification problem, No. 22, 85-99 (2009), Amsterdam
[5] Bay, S.D., Pazzani, M.J.: Detecting group differences: mining contrast sets. Data Min. Knowl. Discov. 5, 213-246 (2001) · Zbl 0982.68048 · doi:10.1023/A:1011429418057
[6] Bayardo, R. J., Efficiently mining long patterns from databases, 85-93 (1998)
[7] Bayardo, R.J., Agrawal, R., Gunopulos, D.: Constraint-based rule mining in large, dense databases. Data Min. Knowl. Discov. 4, 217-240 (2000) · doi:10.1023/A:1009895914772
[8] Blanchard, J.; Guillet, F.; Gras, R.; Briand, H., Using information-theoretic measures to assess association rule interestingness, Houston, Texas, USA
[9] Brijs, T., Vanhoof, K., Wets, G.: Defining interestingness for association rules. Int. J. Inf. Theories Appl. 10(4), 370-376 (2003)
[10] Cheng, H.; Yan, X.; Han, J.; Hsu, C.-W., Discriminative frequent pattern analysis for effective classification, 716-725 (2007) · doi:10.1109/ICDE.2007.367917
[11] Cheng, H.; Yan, X.; Han, J.; Yu, P. S., Direct discriminative pattern mining for effective classification, 169-178 (2008) · doi:10.1109/ICDE.2008.4497425
[12] Frank, A., Asuncion, A.: UCI machine learning repository http://archive.ics.uci.edu/ml Irvine, CA: University of California, School of Information and Computer Science (2010)
[13] Garriga, G.C., Kralj, P., Lavrac, N.: Closed sets for labeled data. J. Mach. Learn. Res. 9, 559-580 (2008) · Zbl 1225.68179
[14] Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM Comput. Surv. 38(3), 9 (2006) · doi:10.1145/1132960.1132963
[15] Goodman, A., Kamath, C., Kumar, V.: Data analysis in the 21st century. Stat. Anal. Data Min. 1(1), 1-3 (2008) · Zbl 07260177 · doi:10.1002/sam.10000
[16] Gouda, K.; Zaki, M. J., Efficiently mining maximal frequent itemsets, 163-170 (2001)
[17] Hadzic, F.; Dillon, T. S., Using the symmetrical tau (τ) criterion for feature selection in decision tree and neural network learning (2006)
[18] Hämäläinen, W.; Nykänen, M., Efficient discovery of statistically significant association rules, 203-212 (2008)
[19] Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15(1), 55-86 (2007) · doi:10.1007/s10618-006-0059-1
[20] Hosmer, D.W., Lemeshow, S.: Applied Logistic Regression. Wiley, New York (1989) · Zbl 0967.62045
[21] Lallich, S.; Teytaud, O.; Prudhomme, E.; Guillet, F. J. (ed.); Hamilton, H. J. (ed.), Association rule interestingness: measure and statistical validation, 251-275 (2007), Berlin · doi:10.1007/978-3-540-44918-8_11
[22] Lavrac, N., Flach, P., Zupan, B.: Rule evaluation measures: a unifying view. Inductive Log. Program. 174-185 (1999)
[23] Bras, Y.; Lenca, P.; Lallich, S., Mining classification rules without support: an anti-monotone property of Jaccard measure, No. 6926, 179-193 (2011), Berlin · doi:10.1007/978-3-642-24477-3_16
[24] Bras, Y.; Lenca, P.; Lallich, S.; Holmes, D. E. (ed.); Jain, L. C. (ed.), Formal framework for the study of algorithmic properties of objective interestingness measures, No. 24, 77-98 (2012) · Zbl 1231.68210
[25] Lenca, P., Meyer, P., Vaillant, B., Lallich, S.: On selecting interestingness measures for association rules: user oriented description and multiple criteria decision aid. Eur. J. Oper. Res. 184, 610-626 (2008) · Zbl 1168.90513 · doi:10.1016/j.ejor.2006.10.059
[26] Li, J.: On optimal rule discovery. IEEE TKDD 18(4), 460-471 (2006)
[27] Li, W.; Han, J.; Pei, J., CMAR: accurate and efficient classification based on multiple class-association rules, 369-376 (2001)
[28] Li, J., Shen, H., Topor, R.W.: Mining the optimal class association rule set. Knowl.-Based Syst. 15, 399-405 (2002) · doi:10.1016/S0950-7051(02)00024-2
[29] Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley, New York (2002) · Zbl 1011.62004
[30] Liu, B.; Hsu, W.; Ma, Y., Integrating classification and association rule mining, 80-86 (1998)
[31] Liu, B.; Ma, Y.; Wong, C.; Zighed, D. (ed.); Komorowski, J. (ed.); Zytkow, J. (ed.), Improving an association rule based classifier, 504-509 (2000) · doi:10.1007/3-540-45372-5_58
[32] McGarry, K.: A survey of interestingness measures for knowledge discovery. Knowl. Eng. Rev. 20, 39-61 (2005) · doi:10.1017/S0269888905000408
[33] Meggido, N.; Srikant, R., Discovering predictive association rules, 274-278 (1998)
[34] Novak, P.K., Lavrac, N., Webb, G.I.: Supervised descriptive rule discovery: a unifying survey of contrast set, emerging patterns and subgroup mining. J. Mach. Learn. Res. 10, 377-403 (2009) · Zbl 1235.68178
[35] Piatetsky-Shapiro, G.: Discovery, analysis and presentation of strong rules. Knowl. Discov. Database 229-248 (1991)
[36] Refaat, M.: Data Preparation for Data Mining Using SAS. Morgan Kaufmann, San Francisco (2007)
[37] Shaharanee, I.N.M., Hadzic, F., Dillon, T.S.: Interestingness measures for association rules based on statistical validity. Knowl.-Based Syst. 24, 386-392 (2011) · doi:10.1016/j.knosys.2010.11.005
[38] Silverstein, C., Brin, S., Motwani, R.: Beyond market baskets: generalizing association rules to dependence rules. Data Min. Knowl. Discov. 2, 39-68 (1998) · doi:10.1023/A:1009713703947
[39] Simon, G. J.; Kumar, V.; Li, P. W., A simple statistical model and association rule filtering for classification, 823-831 (2011) · doi:10.1145/2020408.2020550
[40] Tan, P. N.; Kumar, V.; Srivastava, J., Selecting the right interestingness measure for association patterns, 32-41 (2002)
[41] Veloso, A.; Meira, W.; Zaki, M. J., Lazy associative classification, 645-654 (2006)
[42] Wang, K.; He, Y.; Cheung, D. W., Mining confident rules without support requirement, 89-96 (2001)
[43] Webb, G.I.: Discovering significant patterns. Mach. Learn. 1-33 (2007) · Zbl 1470.68195
[44] Wei, J.-M., Yi, W.-G., Wang, M.-Y.: Novel measurement for mining effective association rules. Knowl.-Based Syst. 19, 739-743 (2006) · doi:10.1016/j.knosys.2006.05.011
[45] Yin, X.; Han, J., CPAR: classification based on predictive association rules, 369-376 (2003)
[46] Zaki, M.J.: Mining non-redundant association rules. Data Min. Knowl. Discov. 9(3), 223-248 (2004) · doi:10.1023/B:DAMI.0000040429.96086.c7
[47] Zaki, M. J.; Hsiao, C. J., CHARM: an efficient algorithm for closed itemset mining (2002)
[48] Zhang, C.; Zhang, S., Collecting quality data for database mining, 131-142 (2001) · Zbl 1052.68598
[49] Zhou, X.J., Dillon, T.S.: A statistical-heuristic feature selection criterion for decision tree induction. IEEE Trans. Pattern Anal. Mach. Intell. 13, 834-841 (1991) · doi:10.1109/34.85676
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.