×

Bibliographic analysis on research publications using authors, categorical labels and the citation network. (English) Zbl 1383.62364

Summary: Bibliographic analysis considers the author’s research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a non-parametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeer\(^{\mathrm{X}}\). The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.

MSC:

62P99 Applications of statistics
68P20 Information storage and retrieval of data
62G05 Nonparametric estimation
62F15 Bayesian inference
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet Allocation. JMLR, 3, 993-1022. · Zbl 1112.68379
[2] Buntine, W., & Hutter, M. (2012). A Bayesian view of the Poisson-Dirichlet process. ArXiv e-prints 1007.0296v2. · Zbl 1189.62191
[3] Buntine, W., & Mishra, S. (2014). Experiments with non-parametric topic models. In KDD (pp 881-890). ACM. · Zbl 0866.62024
[4] Carpenter, B. (2004). Phrasal queries with LingPipe and Lucene: Ad hoc genomics text retrieval. In TREC. · Zbl 1112.68379
[5] Casella, G., & Robert, C. P. (1996). Rao-Blackwellisation of sampling schemes. Biometrika, 83(1), 81-94. · Zbl 0866.62024 · doi:10.1093/biomet/83.1.81
[6] Chang, J., & Blei, D. (2010). Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1), 124-150. · Zbl 1189.62191 · doi:10.1214/09-AOAS309
[7] Chen, C., Du, L., & Buntine, W. (2011). Sampling table configurations for the hierarchical Poisson-Dirichlet process. In ECML (pp. 296-311). Springer. · Zbl 1112.68379
[8] Goldwater, S., Griffiths, T., & Johnson, M. (2011). Producing power-law distributions and damping word frequencies with two-stage language models. JMLR, 12, 2335-2382. · Zbl 1280.62037
[9] Han, H., Giles, C. L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In JCDL (pp. 296-305). ACM.
[10] Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In JCDL (pp. 334-343). ACM.
[11] Kataria, S., Mitra, P., Caragea, C., & Giles, C. L. (2011). Context sensitive topic models for author influence in document networks. In IJCAI (pp. 2274-2280). AAAI Press.
[12] Lim, K. W., & Buntine, W. (2014). Bibliographic analysis with the citation network topic model. In ACML (pp. 142-158).
[13] Lim, K. W., Chen, C., & Buntine, W. (2013). Twitter-network topic model: A full Bayesian treatment for social network and text modeling. In NIPS Topic Model workshop.
[14] Liu, L., Tang, J., Han, J., Jiang, M., & Yang, S. (2010). Mining topic-level influence in heterogeneous networks. In CIKM (pp. 199-208). ACM.
[15] Liu, Y., Niculescu-Mizil, A., & Gryc, W. (2009). Topic-link LDA: Joint models of topic and author community. In ICML (pp. 665-672). ACM. · Zbl 1280.62037
[16] Lui, M., & Baldwin, T. (2012). langid.py: An off-the-shelf language identification tool. In ACL (pp. 25-30). ACL.
[17] Manning, C., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press. · Zbl 1160.68008 · doi:10.1017/CBO9780511809071
[18] McCallum, A. K. (2002). MALLET: A machine learning for language toolkit. http://www.cs.umass.edu/ mccallum/mallet.
[19] Mimno, D., McCallum, A. (2007). Mining a digital library for influential authors. In JCDL (pp. 105-106). ACM.
[20] Nallapati, R., Ahmed, A., Xing, E., & Cohen, W. (2008). Joint latent topic models for text and citations. In KDD (pp. 542-550). ACM.
[21] Oehlert, G. W. (1992). A note on the delta method. The American Statistician, 46(1), 27-29.
[22] Pitman, J. (1996). Some developments of the Blackwell-Macqueen urn scheme. Lecture Notes—Monograph Series (pp. 245-267).
[23] Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In UAI (pp. 487-494). AUAI Press.
[24] Sato, I., & Nakagawa, H. (2010). Topic models with power-law using Pitman-Yor process. In KDD (pp. 673-682). ACM.
[25] Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., & Eliassi-Rad, T. (2008). Collective classification in network data. AI Magazine, 29(3), 93-106.
[26] Tang, J., Sun, J., Wang, C., & Yang, Z. (2009). Social influence analysis in large-scale networks. In KDD (pp. 807-816). ACM.
[27] Teh, Y. W. (2006a). A Bayesian interpretation of interpolated Kneser-Ney. Tech. rep., School of Computing, National University of Singapore.
[28] Teh, Y. W. (2006b). A hierarchical Bayesian language model based on Pitman-Yor processes. In ACL (pp 985-992). ACL.
[29] Teh, Y. W., Jordan, M. (2010). Hierarchical Bayesian nonparametric models with applications. In N. L. Hjort, C. Holmes, P. Müller, & S. G. Walker (Eds.), Bayesian nonparametrics: Principles and practice (Chap. 5). Cambridge University Press.
[30] Tu, Y., Johri, N., Roth, D., & Hockenmaier, J. (2010). Citation author topic model in expert search. In COLING (pp. 1265-1273). ACL.
[31] Wallach, H., Mimno, D., & McCallum, A. (2009). Rethinking LDA: Why priors matter. In NIPS (pp. 1973-1981).
[32] Weng, J., Lim, E. P., Jiang, J., & He, Q. (2010). TwitterRank: Finding topic-sensitive influential Twitterers. In WSDM (pp. 261-270). ACM.
[33] Zhu, Y., Yan, X., Getoor, L., & Moore, C. (2013). Scalable text and link analysis with mixed-topic link models. In KDD (pp 473-481). ACM.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.