×

Effective implementations of topic modeling algorithms. (English) Zbl 1485.68209

Summary: In this paper, we provide an overview of effective EM-like learning algorithms for latent Dirichlet allocation (LDA) models and additively regularized topic models (ARTMs). First, we review 11 techniques for efficient topic modeling based on synchronous and asynchronous parallel computing, distributed data storage, streaming, batch processing, RAM optimization, and fault tolerance improvement. Second, we review 14 effective implementations of topic modeling algorithms proposed in the literature over the past 10 years that use different combinations of the techniques mentioned above. Their comparative analysis is carried out.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Hofmann, T., Probabilistic latent semantic indexing, Proc. 22nd Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 1999, pp. 50-57.
[2] Blei, D.; Ng, A.; Jordan, M., Latent Dirichlet allocation, J. Mach. Learn. Res., 3, 993-1022 (2003) · Zbl 1112.68379
[3] Kochedykov, D., Apishev, M., Golitsyn, L., and Vorontsov, K., Fast and modular regularized topic modeling, Proc. 21st Conf. Open Innovations Association (FRUCT), 2017, pp. 182-193.
[4] Vorontsov, K. V., Additive regularization for topic models of text collection, Dokl. Math., 89, 301-304 (2014) · Zbl 1358.68242
[5] Vorontsov, K. V.; Potapenko, A. A., Additive regularization of topic models, Mach. Learn., 101, 303-323 (2015) · Zbl 1383.62069
[6] Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc., Ser. B, 39, 1-38 (1977) · Zbl 0364.62022
[7] Teh, Y.W., Newman, D., and Welling, M., A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation, Proc. 19th Int. Conf. Neural Information Processing, 2006, pp. 1353-1360.
[8] Steyvers, M.; Griffiths, T., Finding scientific topics, Proc. Natl. Acad. Sci. U. S. A., 101, 5228-5235 (2004)
[9] Vorontsov, K. V.; Potapenko, A. A., EM-like algorithms for probabilistic topic modeling, Mach. Learn. Data Anal., 1, 657-686 (2013)
[10] Asuncion, A., Wekking, M., Smyth, P., and Teh, Y.W., On smoothing and inference for topic models, Proc. Int. Conf. Uncertainty in Artificial Intelligence, 2009, pp. 27-34.
[11] Newman, D.; Asuncion, A.; Smyth, P.; Welling, M., Distributed algorithms for topic models, J. Mach. Learn. Res., 10, 1801-1828 (2009) · Zbl 1235.68324
[12] Wang, Y.; Bai, H.; Stanton, M.; Chen, W.; Chang, E., PLDA: Parallel latent Dirichlet allocation for large-scale applications, Lect. Notes Comput. Sci., 5564, 301-314 (2009)
[13] Asuncion, A.; Smyth, P.; Welling, M., Asynchronous distributed estimation of topic models for document analysis, Stat. Methodol., 8, 3-17 (2010)
[14] Liu, Z., Zhang, Y., Chang, E., and Sun, M., PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing, ACM Trans. Intell. Syst. Technol., vol. 2, no. 3.
[15] Smola, A. and Narayanamurthy, S., An architecture for parallel topic models, Proc. VLDB Endowment, 2010, vol. 3, nos. 1-2, pp. 703-710.
[16] Řehůřek, R. and Sojka, P., Software framework for topic modeling with large corpora, Proc. LREC 2010 Workshop New Challenges for NLP Frameworks, 2010, pp. 45-50.
[17] Zhai, K., Boyd-Graber, J., Asadi, N., and Alkhouja, M.Mr., LDA: A flexible large scale topic modeling package using variational inference in MapReduce, Proc. 21st Int. Conf. World Wide Web, 2012, pp. 879-888.
[18] Qiu, Z.; Wu, B.; Wang, B.; Yu, L., Collapsed Gibbs sampling for latent Dirichlet allocation on Spark, Proc., Machine Learning Research, 36, 17-28 (2014)
[19] Wang, Y., Zhao, X., Sun, Z., Yan, H., Wang, L., Jin, Z., Wang, L., Gao, Y., Law, C., and Zeng, J., Peacock: Learning long-tail topic features for industrial applications, ACM Trans. Intell. Syst. Technol., 2015, vol. 6, no. 4.
[20] Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E., Liu, T., and Ma, W., LightLDA: Big topic models on modest computer clusters, Proc. 24th Int. Conf. World Wide Web, 2015, pp. 1351-1361.
[21] Zhao, B.; Zhou, H.; Li, G.; Huang, Y., ZenLDA: An efficient and scalable topic model training system on distributed data-parallel platform, Big Data Min. Anal., 1, 57-74 (2018)
[22] Vorontsov, K., Frei, O., Apishev, M., Romov, P., Suvorova, M., and Yanina, A., Non-Bayesian additive regularization for multimodal topic modeling of large collections, Proc. Workshop Topic Models: Post-Processing and Applications, 2015, pp. 29-37.
[23] Frei, O.; Apishev, M., Parallel non-blocking deterministic algorithm for online topic modeling, Commun. Comput. Inf. Sci., 661, 132-144 (2016)
[24] Zeng, J.; Liu, Z.; Cao, X., Fast online EM for big topic modeling, IEEE Trans. Knowl. Data Eng., 28, 675-688 (2016)
[25] Hoffman, M., Blei, D., and Bach, F., Online learning for latent Dirichlet allocation, Proc. 23rd Int. Conf. Neural Information Processing Systems, 2010, vol. 1, pp. 856-864.
[26] Vowpal Wabbit ML library. https://github.com/JohnLangford/vowpal_wabbit. Accessed January 15, 2020.
[27] White, T., Hadoop: The Definitive Guide (2012)
[28] Zaharia, M., Chowdhury, M., and Franklin, M., Spark: Cluster computing with working sets, Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing, 2010.
[29] MPI: A message-passing interface standard, Version 3.0, 2012. https://www.mpi-forum.org/docs/mpi-3.0/ mpi30-report.pdf. Accessed January 15, 2020.
[30] Dean, J. and Ghemawat, S., MapReduce: Simplified data processing on large clusters, Proc. 6th Symp. Operating Systems Design and Implementation, 2004, pp. 137-149.
[31] Petuum ML platform. http://www.petuum.com. Accessed January 15, 2020.
[32] Spark GraphX library. https://spark.apache.org/graphx. Accessed January 15, 2020.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.