×

Topic-adjusted visibility metric for scientific articles. (English) Zbl 1358.62114

Summary: Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an article level metric useful for evaluating individual articles’ visibility. This measure derives from joint probabilistic modeling of the content in the articles and the citations among them using latent Dirichlet allocation (LDA) and the mixed membership stochastic blockmodel (MMSB). Our proposed model provides a visibility metric for individual articles adjusted for field variation in citation rates, a structural understanding of citation behavior in different fields, and article recommendations which take into account article visibility and citation patterns. We develop an efficient algorithm for model fitting using variational methods. To scale up to large networks, we develop an online variant using stochastic gradient methods and case-control likelihood approximation. We apply our methods to the benchmark KDD Cup 2003 dataset with approximately 30,000 high energy physics papers.

MSC:

62P99 Applications of statistics
01A90 Bibliographic studies
62F15 Bayesian inference
91D30 Social networks; opinion dynamics

Software:

lda
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Abramo, G. and D’Angelo, C. A. (2011). Evaluating research: From informed peer review to bibliometrics. Scientometrics 87 499-514.
[2] Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981-2014. · Zbl 1225.68143
[3] Alberts, B. (2013). Impact factor distortions. Science 340 787.
[4] Amari, S. (1998). Natural gradient works efficiently in learning. Neural Comput. 10 251-276.
[5] Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval . ACM Press, New York.
[6] Balasubramanyan, R. and Cohen, W. W. (2013). Block-LDA: Jointly modeling entity-annotated text and entity-entity links. In Proceedings of the 2011 SIAM International Conference on Data Mining (B. Liu, H. Liu, C. Clifton, T. Washio and C. Kamath, eds.) 450-461. SIAM Publications Online.
[7] Blei, D. M. and Lafferty, J. D. (2009). Topic models. In Text Mining : Classification , Clustering , and Applications (A. N. Srivastava and M. Sahami, eds.) 71-89. Chapman & Hall/CRC, Boca Raton, FL.
[8] Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993-1022. · Zbl 1112.68379 · doi:10.1162/jmlr.2003.3.4-5.993
[9] Bornmann, L. and Daniel, H. (2008). What do citation counts measure? A review of studies on citing behavior. Journal of Documentation 64 45-80.
[10] Braun, M. and McAuliffe, J. (2010). Variational inference for large-scale models of discrete choice. J. Amer. Statist. Assoc. 105 324-335. · Zbl 1397.62103 · doi:10.1198/jasa.2009.tm08030
[11] Casadevall, A. and Fang, F. C. (2014). Causes for the persistence of impact factor mania. The American Society for Microbiology 5 e00064-14.
[12] Chang, J. (2012). Collapsed Gibbs sampling methods for topic models. R package: lda (version 1.3.2). Available at .
[13] Chang, J. and Blei, D. M. (2010). Hierarchical relational models for document networks. Ann. Appl. Stat. 4 124-150. · Zbl 1189.62191 · doi:10.1214/09-AOAS309
[14] Chen, P. and Redner, S. (2010). Community structure of the physical review citation network. J. Informetr. 4 278-290.
[15] Chen, N., Zhu, L., Xia, F. and Zhang, B. (2013). Generalized relational topic models with data augmentation. In Proceedings of the 23 rd International Joint Conference on Artificial Intelligence (F. Rossi, ed.) 1273-1279. AAAI Press, Menlo Park, CA.
[16] Crespo, J. A., Li, Y. and Ruiz-Castillo, J. (2013). The measurement of the effect on citation inequality of differences in citation practices across scientific fields. PLOS ONE 7 e33833.
[17] Crespo, J. A., Herranz, N., Li, Y. and Ruiz-Castillo, J. (2013). The effect on citation inequality of differences in citation practices at the web of science subject category level. Journal of the Association for Information Science and Technology 65 1244-1256.
[18] Fenner, M. (2014). Altmetrics and other novel measures for scientific impact. In Opening Science (S. Bartling and S. Friesike, eds.) 179-189. Springer, New York.
[19] Garfield, E. (1979). Citation Indexing. Its Theory and Applications in Science , Technology , and Humanities . Wiley, New York.
[20] Garfield, E. (2006). The history and meaning of the journal impact factor. The Journal of the American Medical Association 295 90-93.
[21] Gehrke, J., Ginsparg, P. and Kleinberg, J. M. (2003). Overview of the 2003 KDD cup. SIGKDD Explorations 5 149-151.
[22] Gopalan, P. K. and Blei, D. M. (2013). Efficient discovery of overlapping communities in massive networks. Proc. Natl. Acad. Sci. USA 110 14534-14539. · Zbl 1292.91150 · doi:10.1073/pnas.1221839110
[23] Gopalan, P., Charlin, L. and Blei, D. M. (2014). Content-based recommendations with Poisson factorization. In Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger, eds.) 3176-3184. Curran Associates, Red Hook, NY.
[24] Gubser, S. S. (2010). The Little Book of String Theory . Princeton Univ. Press, Princeton, NJ. · Zbl 1252.81002
[25] Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proc. Natl. Acad. Sci. USA 102 16569-16572. · Zbl 1355.01034 · doi:10.1073/pnas.0507655102
[26] Ho, Q., Eisenstein, J. and Xing, E. P. (2012). Document hierarchies from text and links. In Proceedings of the 21 st International Conference on World Wide Web 739-748. ACM, New York.
[27] Ho, Q., Parikh, A. P. and Xing, E. P. (2012). A multiscale community blockmodel for network exploration. J. Amer. Statist. Assoc. 107 916-934. · Zbl 1443.91242 · doi:10.1080/01621459.2012.682530
[28] Hoffman, M. D., Blei, D. M. and Bach, F. (2010). Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel and A. Culotta, eds.) 856-864. Curran Associates, Red Hook, NY.
[29] Hoffman, M. D., Blei, D. M., Wang, C. and Paisley, J. (2013). Stochastic variational inference. J. Mach. Learn. Res. 14 1303-1347. · Zbl 1317.68163
[30] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for graphical models. Mach. Learn. 37 183-233. · Zbl 0945.68164 · doi:10.1023/A:1007665907178
[31] Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. J. ACM 46 604-632. · Zbl 1065.68660 · doi:10.1145/324133.324140
[32] Knowles, D. A. and Minka, T. P. (2011). Non-conjugate variational message passing for multinomial and binary regression. In Advances in Neural Information Processing Systems 24 1701-1709. Curran Associates, Red Hook, NY.
[33] Kolaczyk, E. D. (2009). Statistical Analysis of Network Data. Methods and Models . Springer, New York. · Zbl 1277.62021 · doi:10.1007/978-0-387-88146-1
[34] Moed, H. F. (2010). Measuring contextual citation impact of scientific journals. J. Informetr. 4 265-277.
[35] Nallapati, R., Ahmed, A., Xing, E. P. and Cohen, W. W. (2008). Joint latent topic model for text and citations. In Proceedings of the 14 th ACM SIGKDD International Conference on Knowledge Discover and Data Mining 542-550. ACM Press, New York.
[36] Neiswanger, W., Wang, C., Ho, Q. and Xing, E. P. (2014). Modeling citation networks using latent random offsets. In Proceedings of 30 th Conference on Uncertainty in Artificial Intelligence (N. L. Zhang and J. Tian, eds.) 633-642. AUAI Press, Corvallis, OR.
[37] Neylon, C. and Wu, S. (2009). Article-level metrics and the evolution of scientific impact. PLOS Biology 7 e1000242.
[38] Rabinovich, M. and Blei, D. M. (2014). The inverse regression topic model. In Proceedings of the 31 st International Conference on Machine Learning , Beijing , China (E. P. Xing and T. Jebara, eds.) J. Mach. Learn. Res. Workshop and Conference Proceedings 32 199-207.
[39] Radicchi, F., Fortunato, S. and Castellano, C. (2008). Universality of citation distributions: Toward an objective measure of scientific impact. Proc. Natl. Acad. Sci. USA 105 17268-17272.
[40] Raftery, A. E., Niu, X., Hoff, P. D. and Yeung, K. Y. (2012). Fast inference for the latent space network model using a case-control approximate likelihood. J. Comput. Graph. Statist. 21 901-919. · doi:10.1080/10618600.2012.679240
[41] Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400-407. · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[42] Roberts, M. E., Stewart, B. M., Tingley, D. and Airoldi, E. M. (2013). The structural topic model and applied social science. In Advances in Neural Information Processing Systems Workshop on Topic Models : Computation , Application , and Evaluation , Nevada , US .
[43] Schubert, A. and Braun, T. (1996). Cross-field normalization of scientometric indicators. Scientometrics 36 311-324.
[44] Seglen, P. O. (1997). Why the impact factor of journals should not be used for evaluating research. Br. Med. J. 314 498-502.
[45] Simons, K. (2008). The misused impact factor. Science 322 165.
[46] Spall, J. C. (2003). Introduction to Stochastic Search and Optimization : Estimation , Simulation , and Control . Wiley, Hoboken, NJ. · Zbl 1088.90002 · doi:10.1002/0471722138
[47] Taddy, M. (2013). Multinomial inverse regression for text analysis. J. Amer. Statist. Assoc. 108 755-770. · Zbl 06224965 · doi:10.1080/01621459.2012.734168
[48] Taddy, M. (2015). Distributed multinomial regression. Ann. Appl. Stat. 9 1394-1414. · Zbl 1454.62036 · doi:10.1214/15-AOAS831
[49] Tan, L. S. L., Chan, A. and Zheng, T. (2016). Supplement to “Topic-adjusted visibility metric for scientific articles.” . · Zbl 1358.62114 · doi:10.1214/15-AOAS887
[50] Vinkler, P. (2003). Relations of relative scientometric indicators. Scientometrics 58 687-694.
[51] Wang, C. and Blei, D. (2011). Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 448-456. ACM Press, New York.
[52] Wang, C., Paisley, J. and Blei, D. M. (2011). Online variational inference for the hierarchical Dirichlet process. In Proc. of the 14 th Int’l. Conf. on Artificial Intelligence and Statistics ( AISTATS ), Fort Lauderdale , FL , USA . (G. Gordon, D. Dunson and M. Dudík, eds.) J. Mach. Learn. Res. Workshop and Conference Proceedings 15 752-760.
[53] Zhang, A., Zhu, J. and Zhang, B. (2013). Sparse relational topic models for document networks. In Machine Learning and Knowledge Discovery in Databases 8188 (H. Blockeel, K. Kersting S. Nijssen and F. Železný, eds.) 670-685. Springer, Heidelberg.
[54] Zhu, Y., Yan, X., Getoor, L. and Moore, C. (2013). Scalable text and link analysis with mixed-topic link models. In Proceedings of the 19 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 473-481. ACM, New York.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.