NetSDM: semantic data mining with network analysis. (English) Zbl 1483.68350

Summary: Semantic data mining (SDM) is a form of relational data mining that uses annotated data together with complex semantic background knowledge to learn rules that can be easily interpreted. The drawback of SDM is a high computational complexity of existing SDM algorithms, resulting in long run times even when applied to relatively small data sets. This paper proposes an effective SDM approach, named NetSDM, which first transforms the available semantic background knowledge into a network format, followed by network analysis based node ranking and pruning to significantly reduce the size of the original background knowledge. The experimental evaluation of the NetSDM methodology on acute lymphoblastic leukemia and breast cancer data demonstrates that NetSDM achieves radical time efficiency improvements and that learned rules are comparable or better than the rules obtained by the original SDM algorithms.


68T09 Computational aspects of data analysis and big data
68R10 Graph theory (including graph drawing) in computer science
68T05 Learning and adaptive systems in artificial intelligence
68T30 Knowledge representation
Full Text: Link


[1] Prem Raj Adhikari, Anˇze Vavpetiˇc, Jan Kralj, Nada Lavraˇc, and Jaakko Hollm´en. Explaining mixture models through semantic pattern mining and banded matrix visualization. Machine Learning, 105(1):3-39, 2016.
[2] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. InProceedings of the 20th International Conference on Very Large Data Bases, pages 487-499, San Francisco, California, USA, 1994.
[3] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, et al. Gene Ontology: Tool for the unification of biology.Nature Genetics, 25(1):25-29, 2000.
[4] Alex Bavelas. Communication patterns in task-oriented groups.Journal of the Acoustical Society of America, 22:725-730, 1950.
[5] Ronald S. Burt and Michael J. Minor.Applied Network Analysis: A Methodological Introduction. Sage Publications, 1983.
[6] Alison Callahan, Jose Cruz-Toledo, Peter Ansell, and Michel Dumontier. Bio2RDF Release 2: Improved coverage, interoperability and provenance of life science linked data. In ESWC, volume 7882 ofLecture Notes in Computer Science, pages 200-212. Springer, 2013.
[7] Alison Callahan, Juan Jos´e Cifuentes, and Michel Dumontier. An evidence-based approach to identify aging-related genes in Caenorhabditis elegans.BMC Bioinformatics, 16(1):1, 2015.
[8] Sabina Chiaretti, Xiaochun Li, Robert Gentleman, Antonella Vitale, Marco Vignetti, Franco Mandelli, Jerome Ritz, and Robin Foa. Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival.Blood, 103(7):2771-2778, 2004.
[9] Lynne S. Cox and Richard Faragher. From old organisms to new molecules: Integrative biology and therapeutic targets in accelerated human ageing.Cellular and Molecular Life Sciences, 64(19-20):2620-2641, 2007.
[10] F. Crestani. Application of spreading activation techniques in information retrieval.Artificial Intelligence Review, 11(6):453-482, December 1997.
[11] L. De Raedt.Logical and Relational Learning. Springer, 2008. · Zbl 1203.68145
[12] Dejing Dou, Hao Wang, and Haishan Liu. Semantic data mining: A survey of ontologybased approaches. InSemantic Computing (ICSC), 2015 IEEE International Conference on, pages 244-251. IEEE, 2015.
[13] Saˇso Dˇzeroski and Nada Lavraˇc, editors.Relational Data Mining. Springer, 2001. 39
[14] Lauri Eronen and Hannu Toivonen. BioMine: Predicting links between biological entities using network models of heterogeneous databases.BMC Bioinformatics, 13:119, 2012.
[15] Linton C. Freeman. A set of measures of centrality based on betweenness.Sociometry, 40: 35-41, 1977.
[16] Linton C. Freeman. Centrality in social networks conceptual clarification.Social Networks, 1(3):215-239, 1979.
[17] Johannes F¨urnkranz, Dragan Gamberger, and Nada Lavraˇc.Foundations of Rule Learning. Springer, 2012.
[18] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA, 2016.
[19] Miha Grˇcar, Nejc Trdin, and Nada Lavraˇc. A methodology for mining document-enriched heterogeneous information networks.The Computer Journal, 56(3):321-335, 2013.
[20] Nicola Guarino, Daniel Oberle, and Steffen Staab. What Is an Ontology? InHandbook on Ontologies, pages 1-17. Springer, 2009.
[21] Robert Hoehndorf, Michel Dumontier, and Georgios V. Gkoutos. Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics.Bioinformatics, 28 (16):2169-2175, 2012.
[22] Da Wei Huang, Brad T. Sherman, and Richard A. Lempicki. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.Nature Protocols, 4(1): 44-57, 2008.
[23] Wilhelmiina H¨am¨al¨ainen.Efficient search for statistically significant dependency rules in binary data. PhD thesis, Department of Computer Science, University of Helsinki, Finland, 2010.
[24] Glen Jeh and Jennifer Widom. SimRank: A measure of structural-context similarity. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 538-543. ACM, 2002.
[25] Mikhail Jiline, Stan Matwin, and Marcel Turcotte. Annotation concept synthesis and enrichment analysis: A logic-based approach to the interpretation of high-throughput experiments.Bioinformatics, 27(17):2391-2398, 2011.
[26] Leo Katz. A new status index derived from sociometric analysis.Psychometrika, 18(1): 39-43, 1953. · Zbl 0053.27606
[27] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment.Journal of the ACM, 46(5):604-632, 1999. · Zbl 1065.68660
[28] Willi Kl¨osgen. Explora: A multipattern and multistrategy discovery assistant. InAdvances in Knowledge Discovery and Data Mining, pages 249-271. American Association for Artificial Intelligence, 1996.
[29] Risi Imre Kondor and John D. Lafferty. Diffusion kernels on graphs and other discrete input spaces. InProceedings of the 19th International Conference on Machine Learning, pages 315-322, 2002.
[30] Nada Lavraˇc and Anˇze Vavpetiˇc. Relational and semantic data mining. InProceedings of the Thirteenth International Conference on Logic Programming and Nonmonotonic Reasoning, pages 20-31, Lexington, KY, USA, 2015.
[31] Nada Lavraˇc, Branko Kavˇsek, Peter Flach, and Ljupˇco Todorovski. Subgroup discovery with CN2-SD.Journal of Machine Learning Research, 5:153-188, 2004.
[32] Agnieszka Lawrynowicz and Jedrzej Potoniec. Fr-ONT: An algorithm for frequent concept mining with formal ontologies. InFoundations of Intelligent Systems, Proceedings of 19th International Symposium on Methodologies for Intelligent Systems (2011), volume 6804 ofLecture Notes in Computer Science, pages 428-437, 2011.
[33] Paea LePendu, Srinivasan V. Iyer, Anna Bauer-Mehren, Rave Harpaz, Jonathan M. Mortensen, Tanya Podchiyska, Todd A. Ferris, and Nigam H Shah. Pharmacovigilance using clinical notes.Clinical Pharmacology & Therapeutics, 93(6):547-555, 2013.
[34] Bing Liu, Wynne Hsu, and Yiming Ma.Integrating classification and association rule mining. InProceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), pages 80-86, 1998.
[35] Haishan Liu, Dejing Dou, Ruoming Jin, Paea LePendu, and Nigam Shah. Mining biomedical ontologies and data using RDF hypergraphs. InProceedings of the 12th International Conference on Machine Learning and Applications (ICMLA), 2013, volume 1, pages 141-146. IEEE, 2013.
[36] Svetlana Lyalina, Bethany Percha, Paea LePendu, Srinivasan V. Iyer, Russ B. Altman, and Nigam H. Shah. Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records.Journal of the American Medical Informatics Association, 20 (e2):e297-e305, 2013.
[37] Donna Maglott, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova. Entrez Gene: Genecentered information at NCBI.Nucleic Acids Research, 33(Database issue):D54-D58, 2005.
[38] Stephen Muggleton. Inverse entailment and Progol.New generation computing, 13(3-4): 245-286, 1995.
[39] Athanasios N Nikolakopoulos and John D Garofalakis. NCDawareRank: A novel ranking method that exploits the decomposable structure of the web. InProceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 143-152. ACM, 2013.
[40] Hiroyuki Ogata, Susumu Goto, Kazushige Sato, Wataru Fujibuchi, Hidemasa Bono, and Minoru Kanehisa. KEGG: Kyoto Encyclopedia of Genes and Genomes.Nucleic Acids Research, 27(1):29-34, 1999.
[41] David Page, V´ıtor Santos Costa, Sriraam Natarajan, Aubrey Barnard, Peggy Peissig, and Michael Caldwell. Identifying adverse drug events by relational learning. InProceedings of the Twenty-sixth AAAI Conference on Artificial Intelligence, volume 2012, page 790, Toronto, Canada, 2012.
[42] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, November 1999.
[43] Peggy L Peissig, Vitor Santos Costa, Michael D Caldwell, Carla Rottscheit, Richard L Berg, Eneida A Mendonca, and David Page. Relational machine learning for electronic health record-driven phenotyping.Journal of Biomedical Informatics, 52:260-270, 2014.
[44] Gregory Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. InKnowledge Discovery in Databases, pages 229-248. Menlo Park, CA: AAI/MIT, 1991.
[45] Vid Podpeˇcan, Nada Lavraˇc, Igor Mozetiˇc, Petra Kralj Novak, Igor Trajkovski, Laura Langohr, Kimmo Kulovesi, Hannu Toivonen, Marko Petek, Helena Motaln, et al. SegMine workflows for semantic microarray data analysis in Orange4WS.BMC Bioinformatics, 12(1):416, 2011.
[46] Monika Puzianowska-Kuznicka and Jacek Kuznicki. Genetic alterations in accelerated ageing syndromes: Do they play a role in natural ageing?The International Journal of Biochemistry & Cell Biology, 37(5):947-960, 2005.
[47] Steffen Rendle. Scaling factorization machines to relational data.Proceedings of the VLDB Endowment, 6(5):337-348, March 2013. ISSN 2150-8097.
[48] Christos Sotiriou, Pratyaksha Wirapati, Sherene Loi, Adrian Harris, Steve Fox, Johanna Smeds, Hans Nordgren, Pierre Farmer, Viviane Praz, Benjamin Haibe-Kains, et al. Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis.Journal of the National Cancer Institute, 98(4):262-272, 2006.
[49] AshwinSrinivasan.TheAlephManual,1999.Availableat http://www.cs.ox.ac.uk/activities/machinelearning/Aleph/aleph.
[50] Yizhou Sun and Jiawei Han.Mining Heterogeneous Information Networks: Principles and Methodologies. Morgan & Claypool Publishers, 2012.
[51] Hannah Tipney and Lawrence Hunter.An introduction to effective use of enrichment analysis software.Human Genomics, 4(3):1, 2010.
[52] Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. InProceedings of the Sixth International Conference on Data Mining, pages 613-622, Washington, DC, USA, 2006.
[53] Igor Trajkovski, Nada Lavraˇc, and Jakub Tolar. SEGS: Search for enriched gene sets in microarray data.Journal of Biomedical Informatics, 41(4):588-601, 2008a.
[54] Igor Trajkovski, Filip ˇZelezn´y, Nada Lavraˇc, and Jakub Tolar. Learning relational descriptions of differentially expressed gene groups.IEEE Transactions on Systems, Man, and Cybernetics, Part C, 38(1):16-25, 2008b.
[55] Oron Vanunu, Oded Magger, Eytan Ruppin, Tomer Shlomi, and Roded Sharan. Associating genes and protein complexes with disease via network propagation.PLoS Computational Biology, 6(1), 2010.
[56] Anˇze Vavpetiˇc, Vid Podpeˇcan, and Nada Lavraˇc. Semantic subgroup explanations.Journal of Intelligent Information Systems, 42(2):233-254, 2014.
[57] Anˇze Vavpetiˇc and Nada Lavraˇc. Semantic subgroup discovery systems and workflows in the SDM-toolkit.The Computer Journal, 56(3):304-320, 2013.
[58] Anˇze. Vavpetiˇc, Petra Kralj Novak, Miha Grˇcar, Igor Mozetiˇc, and Nada Lavraˇc. Semantic data mining of financial news articles. InProceedings of Sixteenth International Conference on Discovery Science (DS 2013), volume 8140 ofLecture Notes in Computer Science, pages 294-307, Singapore, 2013.
[59] Monika ˇZ´akov´a, Filip ˇZelezn´y, Javier A. Sedano, Cyril Masia Tissot, Nada Lavraˇc, Petr Kremen, and Javier Molina. Relational data mining applied to virtual engineering of product designs. InProceedings of the 16th International Conference on Inductive Logic Programming (ILP’06), pages 439-453, Santiago de Compostela, Spain, 2006.
[60] Ian H. Witten and Eibe Frank.Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2005. · Zbl 1076.68555
[61] Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. InProceedings of the First European Conference on Principles of Data Mining and Knowledge Discovery (PKDD ’97), pages 78-87. Springer, 1997.
[62] Wenpu Xing and Ali Ghorbani. Weighted PageRank algorithm. InProceedings of the 2nd Annual Conference on Communication Networks and Services Research, pages 305-314. IEEE, 2004.
[63] Liang Zhang, Bingpeng Ma, Guorong Li, Qingming Huang, and Qi Tian.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.