Yang, Yang; Deng, Ke Generalized theme dictionary models for association pattern discovery. (English) Zbl 07656976 Ann. Appl. Stat. 17, No. 1, 269-293 (2023). Summary: Discovering association patterns of items from a collection of baskets composed of different items is an important problem in various fields. Assuming that each basket is composed of themes of items randomly sampled from a theme dictionary, the theme dictionary model provides a general framework to achieve efficient association pattern discovery with statistical inference. This paper extends the original theme dictionary model by allowing more than one category of items in a basket and only presence/absence of items is observed for each basket with all quantitative information missing. The extended models can solve a larger range of practical problems that cannot be handled by the original theme dictionary model. Both simulation studies and real data applications confirm the superiority of the proposed methods over the existing ones. MSC: 62Pxx Applications of statistics Keywords:association rule mining; cross-category association patterns; missing data problem; Monte Carlo expectation-maximization algorithm; theme dictionary model Software:FP-growth; Apriori; Eclat PDFBibTeX XMLCite \textit{Y. Yang} and \textit{K. Deng}, Ann. Appl. Stat. 17, No. 1, 269--293 (2023; Zbl 07656976) Full Text: DOI References: [1] AGRAWAL, R., IMIELINSKI, T. and SWAMI, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data 207-216. [2] AGRAWAL, R. and SRIKANT, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases 487-499. [3] BLEI, D., ANDREW, Y. and JORDAN, M. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993-1022. · Zbl 1112.68379 [4] BLEI, D. and LAFFERTY, J. (2006). Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning 113-120. [5] BLEI, D. M. and LAFFERTY, J. D. (2007). A correlated topic model of Science. Ann. Appl. Stat. 1 17-35. · Zbl 1129.62122 · doi:10.1214/07-AOAS114 [6] BOOTH, J. G. and HOBERT, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo em algorithm. J. R. Stat. Soc. Ser. B. Stat. Methodol. 61 265-285. · Zbl 0917.62058 [7] BORGELT, C. (2003). Efficient implementations of apriori and eclat. In Proceedings of the IEEE ICDM Workshop on Frequent Item Set Mining Implementations. [8] BORGELT, C. (2004). Recursion pruning for the apriori algorithm. In Proceedings of the IEEE ICDM Workshop on Frequent Item Set Mining Implementations. [9] BORGELT, C. (2005). An implementation of the FP-growth algorithm. In Proceedings of the 1st International Workshop on Open Source Data Mining 1-5. [10] DEMPSTER, A. P., LAIRD, N. M. and RUBIN, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B. Stat. Methodol. 39 1-38. · Zbl 0364.62022 [11] DENG, K., GENG, Z. and LIU, J. S. (2014). Association pattern discovery via theme dictionary models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 319-347. · Zbl 07555453 · doi:10.1111/rssb.12032 [12] DENG, K., LIU, D., GAO, S. and GENG, Z. (2005). Structural learning of graphical models and its applications to traditional Chinese medicine. Lecture Notes in Comput. Sci. 3614 362-367. [13] DOUCET, A., DE FREITAS, J. F. G. and GORDON, N. J. (2001). Sequential Monte Carlo Methods in Practice. Springer, New York. · Zbl 0967.00022 [14] Fearnhead, P. and Clifford, P. (2003). On-line inference for hidden Markov models via particle filters. J. R. Stat. Soc. Ser. B. Stat. Methodol. 65 887-899. · Zbl 1059.62098 · doi:10.1111/1467-9868.00421 [15] FENG, Y., WU, Z., ZHOU, X., ZHOU, Z. and FAN, W. (2006). Methodological review: Knowledge discovery in traditional Chinese medicine: State of the art and perspectives. Artif. Intell. Med. 38 219-236. [16] GUPTA, N., MANGAL, N., TIWARI, K. and MITRA, P. (2006). Mining quantitative association rules in protein sequences. In Data Mining 3755. Springer, Berlin. [17] HAN, J., PEI, J. and YIN, Y. (2000). Mining frequent patterns without candidate generation. SIGMOD Rec. 29 1-12. [18] HE, P., DENG, K., LIU, Z., LIU, D., LIU, J. S. and GENG, Z. (2012). Discovering herbal functional groups of traditional Chinese medicine. Stat. Med. 31 636-642. · doi:10.1002/sim.4146 [19] HUANG, Z., DONG, W., BATH, P., JI, L. and DUAN, H. (2015). On mining latent treatment patterns from electronic medical records. Data Min. Knowl. Discov. 29 914-949. · doi:10.1007/s10618-014-0381-y [20] JORDAN, M., GHAHRAMANI, Z., JAAKKOLA, T. S. and SAUL, L. K. (1999). An introduction to variational methods for graphical models. Mach. Learn. 37 183-233. · Zbl 0945.68164 [21] LIU, J. S. and CHEN, R. (1998). Sequential Monte Carlo methods for dynamic systems. J. Amer. Statist. Assoc. 93 1032-1044. · Zbl 1064.65500 · doi:10.2307/2669847 [22] LIU, J. S., CHEN, R. and WONG, W. H. (1998). Rejection control and sequential importance sampling. J. Amer. Statist. Assoc. 93 1022-1031. · Zbl 1064.65501 · doi:10.2307/2669846 [23] LU, H. (2020). Drug treatment options for the 2019-new coronavirus (2019-nCoV). Biosci. Trends 14 69-71. · doi:10.5582/bst.2020.01020 [24] LUO, H., TANG, Q. L., SHANG, Y. X., LIANG, S. B., YANG, M., ROBINSON, N. and LIU, J. P. (2020). Can Chinese medicine be used for prevention of corona virus disease 2019 (COVID-19)? A review of historical classics, research evidence and current prevention programs. Chin. J. Integr. Med. 26 243-250. [25] MCCULLOCH, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. J. Amer. Statist. Assoc. 92 162-170. · Zbl 0889.62061 · doi:10.2307/2291460 [26] PIATETSKY-SHAPIRO, G. (1991). Discovery, analysis, and presentation of strong rules. Knowl. Discov. Databases 229-248. [27] RAJAK, A. and GUPTA, M. K. (2008). Association rule mining: Applications in various areas. In Proceedings of International Conference on Data Management 3-7. [28] ROSEN-ZVI, M., GRIFFITHS, T., STEYVERS, M. and SMYTH, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence 487-494. [29] RUIZ, F. J. R., ATHEY, S. and BLEI, D. M. (2020). SHOPPER: A probabilistic model of consumer choice with substitutes and complements. Ann. Appl. Stat. 14 1-27. · Zbl 1443.62218 · doi:10.1214/19-AOAS1265 [30] WEBB, G. (2007). Discovering significant patterns. Mach. Learn. 68 1-33. · Zbl 1470.68195 [31] YANG, Y. and DENG, K. (2023). Supplement to “Generalized Theme Dictionary Models for Association Pattern Discovery.” https://doi.org/10.1214/22-AOAS1626SUPP [32] YANG, Y., LI, Q., LIU, Z., YE, F. and DENG, K. (2019). Understanding traditional Chinese medicine via statistical learning of expert-specific electronic medical records. Quant. Biol. 7 201-232. [33] ZAKI, M. (2000). Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12 372-390 This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.