×

Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks. (English) Zbl 1428.68318

Summary: Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.

MSC:

68T50 Natural language processing
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68R10 Graph theory (including graph drawing) in computer science
68T05 Learning and adaptive systems in artificial intelligence
91D30 Social networks; opinion dynamics
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. In: Mining text data. Springer, pp 163-222
[2] Andersen R, Chung F, Lang K (2006) Local graph partitioning using pagerank vectors. In: Proceedings of the IEEE annual symposium on foundations of computer science (FOCS), pp 475-486
[3] Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Zachary I (2007) A nucleus for a web of open data. Springer, Dbpedia
[4] Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 2670-2676
[5] Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 59-68
[6] Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on Freebase from question-answer pairs. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1533-1544
[7] Berg C, Christensen JP, Ressel P (1984) Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100 of graduate texts in mathematics, 1st edn. Springer, Berlin · Zbl 0619.43001 · doi:10.1007/978-1-4612-1128-0
[8] Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res (JMLR) 3:993-1022 · Zbl 1112.68379
[9] Bollacker KD, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the ACM special interest group on management of data (SIGMOD), pp 1247-1250
[10] Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge · Zbl 1058.90049 · doi:10.1017/CBO9780511804441
[11] Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13-47 · Zbl 1234.68399 · doi:10.1162/coli.2006.32.1.13
[12] Cai H, Zheng VW, Chang KC (2018) A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Trans Knowl Data Eng (TKDE)
[13] Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the international joint conference on artificial intelligence (IJCAI) workshop on information integration, pp 73-78
[14] Cui P, Wang X, Pei J, Wenwu Z (2017) A survey on network embedding. CoRR. arXiv:1711.08752
[15] Do Q, Roth D, Sammons M, Tu Y, Vydiswaran VGV (2009) Robust, light-weight approaches to compute lexical similarity. In: Computer science research and technical reports. University of Illinois, pp 94-94
[16] Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 601-610
[17] Dong Y, Chawla NV, Swami A (2017) metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 135-144
[18] Etzioni O, Cafarella M, Downey D (2004) Webscale information extraction in knowitall (preliminary results). In: Proceedings of the international world wide web conference (WWW), pp 100-110
[19] Fu T-Y, Lee W-C, Lei Z (2017) Hin2vec: explore meta-paths in heterogeneous information networks for representation learning. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 1797-1806
[20] Gabrilovich E, Markovitch S (2005) Feature generation for text categorization using world knowledge. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1048-1053
[21] Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1606-1611
[22] Ganesan P, Garcia-Molina H, Widom J (2003) Exploiting hierarchical domain structure to compute similarity. ACM Trans Inf Syst (TOIS) 21(1):64-93 · doi:10.1145/635484.635487
[23] Goyal P, Ferrara E (2018) Graph embedding techniques, applications, and performance: a survey. Knowl Based Syst 151:78-94 · doi:10.1016/j.knosys.2018.03.022
[24] Han J, Sun Y, Yan X, Yu PS (2010) Mining knowledge from databases: an information network analysis approach. In: Proceedings of the ACM special interest group on management of data (SIGMOD), pp 1251-1252
[25] Hassan S, Mihalcea R, Banea C (2007) Random walk term weighting for improved text classification. In: Proceedings of the international conference on semantic computing (ICSC), pp 242-249
[26] He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Proceedings of the neural information processing systems (NIPS), pp 507-514
[27] Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 541-544
[28] Hu J, Fang L, Cao Y, Zeng H-J, Li H, Yang Q, Chen Z (2008) Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 179-186
[29] Hu X, Sun N, Zhang C, Chua T-S (2009a) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 919-928
[30] Hu X, Zhang X, Lu C, Park EK, Zhou X (2009b) Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 389-396
[31] Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the New Zealand computer science research student conference (NZCSRSC), pp 49-56
[32] Huang Z, Mamoulis N (2017) Heterogeneous information network embedding for meta path based proximity. CoRR. arXiv:1701.05291
[33] Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin
[34] Kim J, Rousseau F, Vazirgiannis M (2015) Convolutional sentence kernel from word embeddings for short text categorization. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 775-780
[35] Kong X, Yu PS, Ding Y, Wild DJ (2012) Meta path-based collective classification in heterogeneous information networks. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 1567-1571
[36] Kong X, Zhang J, Yu PS (2013) Inferring anchor links across multiple heterogeneous social networks. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 179-188
[37] Lakkaraju P, Gauch S, Speretta M (2008) Document similarity based on concept tree distance. In: Proceedings of the nineteenth ACM conference on hypertext and hypermedia, pp 127-132
[38] Lang, Ken, NewsWeeder: Learning to Filter Netnews, 331-339 (1995) · doi:10.1016/B978-1-55860-377-6.50048-7
[39] Lao N, Cohen WW (2010) Relational retrieval using a combination of path-constrained random walks. Mach Learn 81(1):53-67 · Zbl 1470.68130 · doi:10.1007/s10994-010-5205-8
[40] Lao N, Mitchell T, Cohen WW (2011) Random walk inference and learning in a large scale knowledge base. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 529-539
[41] Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res (JMLR) 5:361-397
[42] Li Y, Wang C, Han F, Han J, Roth D, Yan X (2013) Mining evidences for named entity disambiguation. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1070-1078
[43] Lu Q, Getoor L (2003) Link-based classification. In: Proceedings of the international conference on machine learning (ICML), pp 496-503
[44] Luss R, d’Aspremont A (2008) Support vector machine classification with indefinite kernels. In: Proceedings of the neural information processing systems (NIPS), pp 953-960
[45] McCallum A, Nigam K, et al (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence (AAAI) workshop, vol 752, pp 41-48
[46] Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the neural information processing systems (NIPS), pp 3111-3119
[47] Mitchell TM, Cohen WW, Hruschka ER Jr, Talukdar PP, Betteridge J, Carlson A, Mishra BD, Gardner M, Kisiel B, Krishnamurthy J, Lao N, Mazaitis K, Mohamed T, Nakashole N, Platanios EA, Ritter A, Samadi M, Settles B, Wang RC, Wijaya DT, Gupta A, Chen X, Saparov A, Greaves M, Welling J (2015) Never-ending learning. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 2302-2310
[48] Mooney, Raymond J., Learning for Semantic Parsing, 311-324 (2007), Berlin, Heidelberg · doi:10.1007/978-3-540-70939-8_28
[49] Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103(1):127-152 · Zbl 1079.90102 · doi:10.1007/s10107-004-0552-5
[50] Paul C, Rettinger A, Mogadala A, Knoblock CA, Pedro S (2016) Efficient graph-based document similarity. In: International conference on the semantic web. latest advances and new domains, pp 334-349
[51] Ponzetto SP, Strube M (2007) Deriving a large-scale taxonomy from Wikipedia. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 1440-1445
[52] Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Proceedings of the SIGNLL conference on computational natural language learning (CoNLL), pp 147-155
[53] Rousseau F, Kiagias E, Vazirgiannis M (2015) Text categorization as a graph classification problem. In: Annual meeting of the association for computational linguistics (ACL), pp 1702-1712
[54] Sahami M (1998) Using machine learning to improve information access. PhD thesis, Stanford University
[55] Schuhmacher M, Ponzetto SP (2014) Knowledge-based graph document modeling. In: Proceedings of the ACM international conference on web search and data mining (WSDM), pp 543-552
[56] Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1-47 · doi:10.1145/505282.505283
[57] Shang J, Qu M, Liu J, Kaplan LM, Han J, Peng J (2016) Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. CoRR. arXiv:610.09769
[58] Siolas G, d’Alché-Buc F (2000) Support vector machines based on a semantic kernel for text categorization. In: Proceedings of international joint conference on neural networks (IJCNN), pp 205-209
[59] Song Y, Roth D (2015) Unsupervised sparse vector densification for short text similarity. In: Proceedings of the annual conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 1275-1280
[60] Song Y, Pan S, Liu S, Zhou MX, Qian W (2009) Topic and keyword re-ranking for LDA-based topic modeling. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 1757-1760
[61] Song Y, Wang H, Wang Z, Li H, Chen W (2011) Short text conceptualization using a probabilistic knowledgebase. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 2330-2336
[62] Song Y, Wang S, Wang H (2015) Open domain short text conceptualization: a generative + descriptive modeling approach. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 3820-3826
[63] Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res (JMLR) 3:583-617 · Zbl 1084.68759
[64] Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 58-64
[65] Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the international world wide web conference (WWW), pp 697-706
[66] Sun Y, Han J, Yan X, Yu PS, Tianyi W (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc VLDB Endow (PVLDB) 4(11):992-1003
[67] Sun Y, Norick B, Han J, Yan X, Yu PS, Yu X (2012) Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1348-1356
[68] Sun Y, Norick B, Han J, Yan X, Yu PS, Yu X (2013) Pathselclus: integrating meta-path selection with user-guided object clustering in heterogeneous information networks. ACM Trans Knowl Discov Data (TKDD) 7(3):11:1-11:23
[69] Tang J, Qu M, Mei Q (2015) PTE: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1165-1174
[70] Traverso I, Vidal M-E, Kämpgen B, Sure-Vetter Y (2016) GADES: a graph-based semantic similarity measure. In: International conference on semantic systems (SEMANTiCS), pp 101-104
[71] Tu K, Cui P, Wang X, Wang F, Zhu W (2018) Structural deep embedding for hyper-networks. In: Proceedings of the association for the advancement of artificial intelligence (AAAI)
[72] Wan X, Peng Y (2005) The earth mover’s distance as a semantic measure for document similarity. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 301-302
[73] Wang C, Duan N, Zhou M, Zhang M (2013) Paraphrasing adaptation for web search ranking. In: Annual meeting of the association for computational linguistics (ACL), pp 41-46
[74] Wang C, Song Y, El-Kishky A, Roth D, Zhang M, Han J (2015a) Incorporating world knowledge to document clustering via heterogeneous information networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1215-1224
[75] Wang C, Song Y, Li H, Zhang M, Han J (2015b) Knowsim: a document similarity measure on structured heterogeneous information networks. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 506-513
[76] Wang C, Song Y, Roth D, Wang C, Han J, Ji H, Zhang M (2015c) Constrained information-theoretic tripartite graph clustering to identify semantically similar relations. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 3882-3889
[77] Wang C, Song Y, Li H, Zhang M, Han J (2016a) Text classification with heterogeneous information network kernels. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 2130-2136
[78] Wang C, Song Y, Roth D, Zhang M, Han J (2016) World knowledge as indirect supervision for document clustering. ACM Trans Knowl Discov Data (TKDD) 11(2):13:1-13:36
[79] Wang C, Song Y, Li H, Sun Y, Zhang M, Han J (2017) Distant meta-path similarities for text-based heterogeneous information networks. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 1629-1638
[80] Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 713-721
[81] Wang P, Hu J, Zeng H-J, Chen L, Chen Z (2007) Improving text classification by using encyclopedia knowledge. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 332-341
[82] Wang, Wei; Do, Diep Bich; Lin, Xuemin, Term Graph Model for Text Classification, 19-30 (2005), Berlin, Heidelberg · doi:10.1007/11527503_5
[83] Ying Y, Campbell C, Girolami M (2009) Analysis of SVM with indefinite kernels. In: Proceedings of the neural information processing systems (NIPS), pp 2205-2213
[84] Zelnik-Manor L, Perona P (2005) Self-tuning spectral clustering. In: Saul LK, Weiss Y, Bottou L (eds) Proceedings of the neural information processing systems (NIPS), pp 1601-1608
[85] Zhang J, Kong X, Yu PS (2013) Predicting social links for new users across aligned heterogeneous social networks. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 1289-1294
[86] Zhang J, Kong X, Yu PS (2014) Transferring heterogeneous links across location-based social networks. In: Proceedings of the ACM international conference on web search and data mining (WSDM), pp 303-312
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.