×

Disclosed: an efficient depth-first, top-down algorithm for mining disjunctive closed itemsets in high-dimensional data. (English) Zbl 1355.68235

Summary: We focus, in this paper, on the computational challenges of identifying disjunctive Boolean patterns in high-dimensional data. We conduct our analysis focusing particularly in microarray gene expression data, since this is one of the most stereotypical examples of high-dimensional data. We devised a novel algorithm that takes advantage of the scarcity of samples in microarray data sets, allowing us to efficiently find disjunctive closed patterns. Our algorithm, Disclosed, mines disjunctive closed itemsets by exploring the search space in a depth-first, top-down manner.
We evaluated the performance of our algorithm to execute such a task using real microarray gene expression data sets publicly available on the Internet. Our experiments revealed under what situations, the characteristics of a data set, our method obtain a good, bad or average performance. We also compared the performance of our method with the state of the art algorithms for finding disjunctive closed patterns and disjunctive minimal generators. We observed that our approach is two orders of magnitude more efficient, both in terms of time and memory.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68T10 Pattern recognition, speech recognition
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Agrawal, R.; Srikant, R., Fast algorithms for mining association rules in large databases, (VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, (1994), Morgan Kaufmann Publishers Inc. San Francisco, CA, USA), 487-499
[2] Alizadeh, A. A.; Eisen, M. B.; Davis, R. E.; Ma, C.; Lossos, I. S.; Rosenwald, A.; Boldrick, J. C.; Sabet, H.; Tran, T.; Yu, X.; Powell, J. I.; Yang, L.; Marti, G. E.; Moore, T.; Hudson; Jr, J.; Lu, L.; Lewis, D. B.; Tibshirani, R.; Sherlock, G.; Chan, W. C.; Greiner, T. C.; Weisenburger, D. D.; Armitage, J. O.; Warnke, R.; Levy, R.; Wilson, W.; Grever, M. R.; Byrd, J. C.; Botstein, D.; Brown, P. O.; Staudt, L. M., Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling, Nature, 403, 503-511, (2000)
[3] Alon, U.; Barkai, N.; Notterman, D. A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A. J., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci., 96, 6745-6750, (1999)
[4] R. Alves, D.S. Rodriguez-Baena, J.S. Aguilar-Ruiz, Gene association analysis: a survey of frequent pattern mining from gene expression data, Briefings in Bioinformatics, 2009, bbp042, <http://arxiv.org/abs/http://bib.oxfordjournals.org/cgi/reprint/bbp042v
[5] Brunet, J. P.; Tamayo, P.; Golub, T. R.; Mesirov, J. P., Metagenes and molecular pattern discovery using matrix factorization, Proc. Natl. Acad. Sci., 101, 4164-4169, (2004)
[6] Chandran, U.; Ma, C.; Dhir, R.; Bisceglia, M.; Lyons-Weiler, M.; Liang, W.; Michalopoulos, G.; Becich, M.; Monzon, F., Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process, BMC Cancer, 7, 64, (2007)
[7] Chen, Q.; Chen, Y. P., Mining frequent patterns for AMP-activated protein kinase regulation on skeletal muscle, BMC Bioinform., 7, 394, (2006)
[8] G. Cong, K.L. Tan, A.K.H. Tung, F. Pan, Mining frequent closed patterns in microarray data, in: ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining, 2004, pp. 363-366. http://dx.doi.org/10.1109/ICDM.2004.10070.
[9] Cong, G.; Tung, A. K.H.; Xu, X.; Pan, F.; Yang, J., FARMER: finding interesting rule groups in microarray datasets, (Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, (2004), ACM New York, NY, USA), 143-154, http://dx.doi.org/10.1145/1007568.1007587
[10] Creighton, C.; Hanash, S., Mining gene expression databases for association rules, Bioinformatics, 19, 79-86, (2003), <http://arxiv.org/abs/http://bioinformatics.oxfordjournals.org/content/
[11] Dries, A.; De Raedt, L.; Nijssen, S., Mining predictive k-CNF expressions, IEEE Trans. Knowl. Data Eng., 22, 743-748, (2010)
[12] Eisen, M. B.; Spellman, P. T.; Brown, P. O.; Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci., 95, 14863-14868, (1998), <http://arxiv.org/abs/http://www.pnas.org/content/95/25/14863.full.pdf+
[13] Fang, G.; Haznadar, M.; Wang, W.; Yu, H.; Steinbach, M.; Church, T. R.; Oetting, W. S.; Van Ness, B.; Kumar, V., High-order snp combinations associated with complex diseases: efficient discovery, statistical power and functional interactions, PLoS ONE, 7, e33531, (2012), http://dx.doi.org/10.1371/journal.pone.0033531
[14] U.M. Fayyad, K.B. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, in: IJCAI, 1993, pp. 1022-1029.
[15] A. Frank, A. Asuncion, UCI machine learning repository, 2010. <http://archive.ics.uci.edu/ml>.
[16] N. Goel, M. Hsiao, N. Ramakrishnan, M. Zaki, Mining complex boolean expressions for sequential equivalence checking, in: 2010 19th IEEE Asian Test Symposium (ATS), 2010, pp. 442-447. http://dx.doi.org/10.1109/ATS.2010.81.
[17] Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J. P.; Coller, H.; Loh, M. L.; Downing, J. R.; Caligiuri, M. A.; Bloomfield, C. D.; Lander, E. S., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 531-537, (1999)
[18] Gyenesei, A.; Wagner, U.; Barkow-Oesterreicher, S.; Stolte, E.; Schlapbach, R., Mining co-regulated gene profiles for the detection of functional associations in gene expression data, Bioinformatics, 23, 1927-1935, (2007), <http://arxiv.org/abs/http://bioinformatics.oxfordjournals.org/content/
[19] Haglin, D. J.; Manning, A. M., On minimal infrequent itemset mining, (Stahlbock, R.; Crone, S. F.; Lessmann, S., Proceedings of the 2007 International Conference on Data Mining, (2007), CSREA Press), 141-147
[20] Hamrouni, T.; Ben Yahia, S.; Mephu Nguifo, E., Sweeping the disjunctive search space towards mining new exact concise representations of frequent itemsets, Data Knowl. Eng., 68, 1091-1111, (2009), http://dx.doi.org/10.1016/j.datak.2009.05.001
[21] Han, J.; Pei, J.; Yin, Y., Mining frequent patterns without candidate generation, (SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, (2000), ACM New York, NY, USA), 1-12, http://dx.doi.org/10.1145/342009.335372
[22] Han, J.; Pei, J.; Yin, Y.; Mao, R., Mining frequent patterns without candidate generation: a frequent-pattern tree approach, Data Min. Knowl. Disc., 8, 53-87, (2004), http://dx.doi.org/10.1023/B:DAMI.0000005258.31418.83
[23] Koh, Y. S.; Rountree, N., Finding sporadic rules using apriori-inverse, (Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, (2005), Springer-Verlag Berlin, Heidelberg), 97-106
[24] Lesnick, T. G.; Papapetropoulos, S.; Mash, D. C.; Ffrench-Mullen, J.; Shehadeh, L.; de Andrade, M.; Henley, J. R.; Rocca, W. A.; Ahlskog, J. E.; Maraganore, D. M., A genomic pathway approach to a complex disease: axon guidance and parkinson disease, PLoS Genet., 3, e98, (2007)
[25] Li, G.; Zaki, M. J., Sampling minimal frequent Boolean (DNF) patterns, (Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2012), ACM New York, NY, USA), 87-95, http://dx.doi.org/10.1145/2339530.2339547
[26] Li, J.; Liu, H.; Downing, J. R.; Yeoh, A. E.J.; Wong, L., Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients, Bioinformatics, 19, 71-78, (2003)
[27] J. Li, H. Liu, L. Wong, Mean-entropy discretized features are effective for classifying high-dimensional biomedical data, in: M.J. Zaki, J.T.L. Wang, H. Toivonen (Eds.), BIOKDD, 2003, pp. 17-24.
[28] Liu, H.; Wang, X.; He, J.; Han, J.; Xin, D.; Shao, Z., Top-down mining of frequent closed patterns from very high dimensional data, Inform. Sci., 179, 899-924, (2009), http://dx.doi.org/10.1016/j.ins.2008.11.033 · Zbl 1162.68561
[29] Lockstone, H. E.; Harris, L. W.; Swatton, J. E.; Wayland, M. T.; Holland, A. J.; Bahn, S., Gene expression profiling in the adult down syndrome brain, Genomics, 90, 647-660, (2007)
[30] Ma, L.; Assimes, T. L.; Asadi, N. B.; Iribarren, C.; Quertermous, T.; Wong, W. H., An “almost exhaustive” search-based sequential permutation method for detecting epistasis in disease association studies, Genet. Epidemiol., 34, 434-443, (2010), http://dx.doi.org/10.1002/gepi.20496
[31] Mannila, H.; Toivonen, H., Multiple uses of frequent sets and condensed representations, (Simoudis, E.; Han, J. W.; Fayyad, U., Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, (1996), AAAI Press), 189-194
[32] Mannila, H.; Toivonen, H., Levelwise search and borders of theories in knowledge discovery, Data Min. Knowl. Disc., 1, 241-258, (1997)
[33] McIntosh, T.; Chawla, S., High confidence rule mining for microarray analysis, IEEE/ACM Trans. Comput. Biol. Bioinform., 4, 611-623, (2007), http://dx.doi.org/10.1109/tcbb.2007.1050
[34] Nanavati, A. A.; Chitrapura, K. P.; Joshi, S.; Krishnapuram, R., Mining generalised disjunctive association rules, (CIKM ’01: Proceedings of the Tenth International Conference on Information and Knowledge Management, (2001), ACM New York, NY, USA), 482-489, http://dx.doi.org/10.1145/502585.502666
[35] S. Naulaerts, P. Meysman, W. Bittremieux, T.N. Vu, W. Vanden Berghe, B. Goethals, K. Laukens, A primer to frequent itemset mining for bioinformatics, Briefings in Bioinformatics, 2013. <http://bib.oxfordjournals.org/content/early/2013/10/26/bib.bbt074.abstr
[36] Nindl, I.; Dang, C.; Forschner, T.; Kuban, R. J.; Meyer, T.; Sterry, W.; Stockfleth, E., Identification of differentially expressed genes in cutaneous squamous cell carcinoma by microarray expression profiling, Mol. Cancer, 5, 30, (2006)
[37] F. Pan, G. Cong, A.K.H. Tung, J. Yang, M. Zaki, CARPENTER: Finding closed patterns in long biological datasets, in: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
[38] Pan, F.; Tung, A. K.H.; Cong, G.; Xu, X., COBBLER: combining column and row enumeration for closed pattern discovery, (Proceedings of the 16th International Conference on Scientific and Statistical Database Management, (2004), IEEE Computer Society Los Alamitos, CA, USA), 21, http://dx.doi.org/10.1109/SSDM.2004.1311190
[39] Parida, L.; Ramakrishnan, N., Redescription mining: structure theory and algorithms, (Proceedings of the Twentieth National Conference on Artificial Intelligence, (2005), AAAI Press), 189-194
[40] Pasquier, N.; Bastide, Y.; Taouil, R.; Lakhal, L., Discovering frequent closed itemsets for association rules, (ICDT ’99: Proceedings of the 7th International Conference on Database Theory, (1999), Springer-Verlag London, UK), 398-416
[41] Piatetsky-Shapiro, G.; Tamayo, P., Microarray data mining: facing the challenges, SIGKDD Explor. Newslett., 5, 1-5, (2003), http://dx.doi.org/10.1145/980972.980974
[42] Pomeroy, S. L.; Tamayo, P.; Gaasenbeek, M.; Sturla, L. M.; Angelo, M.; McLaughlin, M. E.; Kim, J. Y.H.; Goumnerova, L. C.; Black, P. M.; Lau, C.; Allen, J. C.; Zagzag, D.; Olson, J. M.; Curran, T.; Wetmore, C.; Biegel, J. A.; Poggio, T.; Mukherjee, S.; Rifkin, R.; Califano, A.; Stolovitzky, G.; Louis, D. N.; Mesirov, J. P.; Lander, E. S.; Golub, T. R., Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature, 415, 436-442, (2002)
[43] Potamias, G.; Koumakis, L.; Kanterakis, A.; Moustakis, V., Towards the discovery of reliable biomarkers from gene-expression profiles: an iterative constraint satisfaction learning approach, (Proceedings of the 6th Hellenic Conference on Artificial Intelligence: Theories, Models and Applications, (2010), Springer-Verlag Berlin, Heidelberg), 233-242, http://dx.doi.org/10.1007/978-3-642-12842-4_27
[44] Quinlan, J. R., C4.5: programs for machine learning, (1993), Morgan Kaufmann Publishers Inc. San Francisco, CA, USA
[45] Ramakrishnan, N.; Zaki, M., Redescription mining and applications in bioinformatics, (Chen, J.; Lonardi, S., Biological Data Mining, CRC Data Mining and Knowledge Discovery Series, (2009), CRC Press, Chapman & Hall), 561-586, (Chapter 22)
[46] Ravetti, M. G.; Moscato, P., Identification of a 5-protein biomarker molecular signature for predicting alzheimer’s disease, PLoS ONE, 3, e3111, (2008)
[47] Richardson, A. L.; Wang, Z. C.; De Nicolo, A.; Lu, X.; Brown, M.; Miron, A.; Liao, X.; Iglehart, J. D.; Livingston, D. M.; Ganesan, S., X chromosomal abnormalities in basal-like human breast cancer, Cancer Cell, 9, 121-132, (2006)
[48] Sahoo, D.; Dill, D. L.; Gentles, A. J.; Tibshirani, R.; Plevritis, S. K., Boolean implication networks derived from large scale, whole genome microarray datasets, Genome Biol., 9, R157+, (2008), http://dx.doi.org/10.1186/gb-2008-9-10-r157
[49] Savasere, A.; Omiecinski, E.; Navathe, S. B., Mining for strong negative associations in a large database of customer transactions, (ICDE ’98: Proceedings of the Fourteenth International Conference on Data Engineering, (1998), IEEE Computer Society Washington, DC, USA), 494-502
[50] C.R. Scherzer, A.C. Eklund, L.J. Morse, Z. Liao, J.J. Locascio, D. Fefer, M.A. Schwarzschild, M.G. Schlossmacher, M.A. Hauser, J.M. Vance, L.R. Sudarsky, D.G. Standaert, J.H. Growdon, R.V. Jensen, S.R. Gullans, Molecular markers of early Parkinson’s disease based on gene expression in blood, in: Proceedings of the National Academy of Sciences 104, 2007, pp. 955-960. http://dx.doi.org/10.1073/pnas.0610204104. <http://arxiv.org/abs/http://www.pnas.org/content/104/3/955.full.pdf+ht
[51] Srikant, R.; Agrawal, R., Mining generalized association rules, (VLDB ’95: Proceedings of the 21th International Conference on Very Large Data Bases, (1995), Morgan Kaufmann Publishers Inc San Francisco, CA, USA), 407-419
[52] Strunnikova, N.; Hilmer, S.; Flippin, J.; Robinson, M.; Hoffman, E.; Csaky, K. G., Differences in gene expression profiles in dermal fibroblasts from control and patients with age-related macular degeneration elicited by oxidative injury, Free Rad. Biol. Med., 39, 781-796, (2005)
[53] Stumme, G.; Taouil, R.; Bastide, Y.; Pasquier, N.; Lakhal, L., Computing iceberg concept lattices with titanic, Data Knowl. Eng., 42, 189-222, (2002), http://dx.doi.org/10.1016/S0169-023X(02)00057-5 · Zbl 0996.68046
[54] L. Szathmary, A. Napoli, P. Valtchev, Towards rare itemset mining, in: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, 2007, pp. 305-312. http://dx.doi.org/10.1109/ICTAI.2007.30.
[55] L. Szathmary, P. Valtchev, A. Napoli, R. Godin, Efficient vertical mining of minimal rare itemsets, in: Proceeding of the 9th International Conference on Concept Lattices and Their Applications, 2012, pp. 269-280.
[56] Thummalapenta, S.; Xie, T., Alattin: mining alternative patterns for defect detection, Autom. Softw. Eng., 18, 293-323, (2011)
[57] Troiano, L.; Scibelli, G., A time-efficient breadth-first level-wise lattice-traversal algorithm to discover rare itemsets, Data Min. Knowl. Disc., 1-35, (2013), http://dx.doi.org/10.1007/s10618-013-0304-3
[58] Tsang, S.; Koh, Y.; Dobbie, G., RP-tree: rare pattern tree mining, (Cuzzocrea, A.; Dayal, U., Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science, vol. 6862, (2011), Springer Berlin/Heidelberg), 277-288
[59] Varadan, V.; Anastassiou, D., Inference of disease-related molecular logic from systems-based microarray analysis, PLoS Comput. Biol., 2, (2006), http://dx.doi.org/10.1371/journal.pcbi.0020068
[60] Vimieiro, R.; Moscato, P., Mining disjunctive minimal generators with titanicor, Expert Syst. Appl., 39, 8228-8238, (2012), http://dx.doi.org/10.1016/j.eswa.2012.01.141
[61] Vimieiro, R.; Moscato, P., A new method for mining disjunctive emerging patterns in high-dimensional datasets using hypergraphs, Inform. Syst., 40, 1-10, (2014), <http://www.sciencedirect.com/science/article/pii/S0306437913001221>, http://dx.doi.org/10.1016/j.is.2013.09.001
[62] Weiss, G. M., Mining with rarity: a unifying framework, SIGKDD Explor. Newslett., 6, 7-19, (2004)
[63] Yeoh, E. J.; Ross, M. E.; Shurtleff, S. A.; Williams, W.; Patel, D.; Mahfouz, R.; Behm, F. G.; Raimondi, S. C.; Relling, M. V.; Patel, A.; Cheng, C.; Campana, D.; Wilkins, D.; Zhou, X.; Li, J.; Liu, H.; Pui, C. H.; Evans, W. E.; Naeve, C.; Wong, L.; Downing, J. R., Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling, Cancer Cell, 1, 133-143, (2002), <http://www.sciencedirect.com/science/article/pii/S1535610802000326>, http://dx.doi.org/10.1016/S1535-6108(02)00032-6
[64] Zaki, M.; Ramakrishnan, N.; Zhao, L., Mining frequent Boolean expressions: application to gene expression and regulatory modeling, Int. J. Knowl. Disc. Bioinform. (IJKDB), 1, 68-96, (2010)
[65] Zaki, M. J.; Hsiao, C. J., CHARM: an efficient algorithm for closed itemset mining, (Grossman, R. L.; Han, J.; Kumar, V.; Mannila, H.; Motwani, R., Proceedings of the Second SIAM International Conference on Data Mining, (2002), SIAM)
[66] Zaki, M. J.; Ramakrishnan, N., Reasoning about sets using redescription mining, (Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, (2005), ACM New York, NY, USA), 364-373, http://dx.doi.org/10.1145/1081870.1081912
[67] Zhao, L.; Zaki, M. J.; Ramakrishnan, N., BLOSOM: a framework for mining arbitrary Boolean expressions, (Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’06, (2006), ACM New York, NY, USA), 827-832, http://dx.doi.org/10.1145/1150402.1150511
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.