×

Clustering sparse binary data with hierarchical Bayesian Bernoulli mixture model. (English) Zbl 1469.62170

Summary: Sparsity in features presents a big technical challenge to existing clustering methods for categorical data. Hierarchical Bayesian Bernoulli mixture model (HBBMM) incorporates constrained empirical Bayes priors for model parameters, so the resulting Expectation Maximization (EM) algorithm of estimator searching is confined in a proper region. The EM algorithm enables to obtain the maximum a posterior (MAP) estimation, in which cluster labels are simultaneously assigned. Three criteria are proposed to identify defining features of individual clusters, leading to understanding of the underlying data structures. Information based model selection criterion is applied to determine the number of clusters. Estimation consistency and performance of model selection criteria are investigated. Two real-world sparse categorical datasets are analyzed with the proposed method.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Achcar, Fiona; Camadro, Jean-Michel; Mestivier, Denis, Autoclass@ ijm: a powerful tool for Bayesian classification of heterogeneous data in biology, Nucleic Acids Res., gkp430, (2009)
[2] Aggarwal, Charu C.; Reddy, Chandan K., Data clustering: algorithms and applications, (2013), CRC press · Zbl 1331.62026
[3] Arias-Castro, Ery; Pu, Xiao, A simple approach to sparse clustering, Comput. Statist. Data Anal., 105, 217-228, (2017) · Zbl 1466.62020
[4] Atienza, N.; Garciaheras, J.; Munozpichardo, J. M.; Villa, Rafael, On the consistency of mle in finite mixture models of exponential families, J. Statist. Plann. Inference, 137, 2, 496-505, (2007) · Zbl 1102.62020
[5] Azizyan, Martin; Singh, Aarti; Wasserman, Larry, Efficient sparse clustering of high-dimensional non-spherical Gaussian mixtures, (Artificial Intelligence and Statistics, (2015)), 37-45
[6] Chaturvedi, Anil; Green, Paul E.; Douglas Caroll, J., K-modes clustering, J. Classification, 18, 1, 35-55, (2001) · Zbl 1047.91566
[7] Cheeseman, Peter; Kelly, James; Self, Matthew; Stutz, John; Taylor, Will; Freeman, Don, Autoclass: A Bayesian classification system, (Readings in Knowledge Acquisition and Learning, (1993), Morgan Kaufmann Publishers Inc.), 431-441
[8] Chen, Jiahua; Li, Pengfei, Hypothesis test for normal mixture models: the EM approach, Ann. Statist., 2523-2542, (2009) · Zbl 1173.62007
[9] Chen, Yudong; Sanghavi, Sujay; Xu, Huan, Clustering sparse graphs, (Advances in Neural Information Processing Systems, (2012)), 2204-2212
[10] Ciarelli, Patrick Marques; Oliveira, Elias, Agglomeration and elimination of terms for dimensionality reduction, (Ninth International Conference on Intelligent Systems Design and Applications, 2009, ISDA’09, (2009), IEEE), 547-552
[11] Dacunhacastelle, Didier; Gassiat, Elisabeth, Testing the order of a model using locally conic parametrization : population mixtures and stationary arma processes, Ann. Statist., 27, 4, 1178-1209, (1999) · Zbl 0957.62073
[12] David, Gil; Averbuch, Amir, Spectralcat: categorical spectral clustering of numerical and nominal data, Pattern Recognit., 45, 1, 416-433, (2012) · Zbl 1225.68171
[13] Dhillon, Inderjit S.; Modha, Dharmendra S., Concept decompositions for large sparse text data using clustering, Mach. Learn., 42, 1, 143-175, (2001) · Zbl 0970.68167
[14] Elhamifar, Ehsan; Vidal, René, Sparse subspace clustering, (IEEE Conference on Computer Vision and Pattern Recognition, 2009, CVPR 2009, (2009), IEEE), 2790-2797
[15] Elhamifar, Ehsan; Vidal, René, Sparse manifold clustering and embedding, (Advances in Neural Information Processing Systems, (2011)), 55-63
[16] Floriello, Davide; Vitelli, Valeria, Sparse clustering of functional data, J. Multivariate Anal., 154, 1-18, (2017) · Zbl 1353.62069
[17] Grün, Dominic; Lyubimova, Anna; Kester, Lennart; Wiebrands, Kay; Basak, Onur; Sasaki, Nobuo; Clevers, Hans; van Oudenaarden, Alexander, Single-cell messenger RNA sequencing reveals rare intestinal cell types, Nature, 525, 7568, 251-255, (2015)
[18] Hochbaum, Dorit S.; Shmoys, David B., A best possible heuristic for the k-center problem, Math. Oper. Res., 10, 2, 180-184, (1985) · Zbl 0565.90015
[19] Huang, Zhexue, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Discov., 2, 3, 283-304, (1998)
[20] Iam-On, Natthakan; Boongeon, Tossapon; Garrett, Simon; Price, Chris, A link-based cluster ensemble approach for categorical data clustering, IEEE Trans. Knowl. Data Eng., 24, 3, 413-425, (2012)
[21] Jaccard, Paul, 1902. Lois de distribution florale dans la zone alpine, Corbaz.; Jaccard, Paul, 1902. Lois de distribution florale dans la zone alpine, Corbaz.
[22] Jasra, Ajay; Holmes, Chris C.; Stephens, David A., Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling, Statist. Sci., 50-67, (2005) · Zbl 1100.62032
[23] Jeffreys, H., Theory of probability, (1961), Oxford University Press Oxford · Zbl 0116.34904
[24] Jing, Liping; Ng, Michael K.; Huang, Joshua Zhexue, An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data, IEEE Trans. Knowl. Data Eng., 19, 8, (2007)
[25] Johnson, Stephen C., Hierarchical clustering schemes, Psychometrika, 32, 3, 241-254, (1967) · Zbl 1367.62191
[26] Keribin, Christine, Consistent estimation of the order of mixture models, Sankhyā, 49-66, (2000) · Zbl 1081.62516
[27] Krzakala, Florent; Moore, Cristopher; Mossel, Elchanan; Neeman, Joe; Sly, Allan; Zdeborová, Lenka; Zhang, Pan, Spectral redemption in clustering sparse networks, Proc. Natl. Acad. Sci., 110, 52, 20935-20940, (2013) · Zbl 1359.62252
[28] Labiod, Lazhar; Nadif, Mohamed, Co-clustering for binary and categorical data with maximum modularity, (2011 IEEE 11th International Conference on Data Mining, (ICDM), (2011), IEEE), 1140-1145
[29] Lei, Jing; Rinaldo, Alessandro, Consistency of spectral clustering in stochastic block models, Ann. Statist., 43, 1, 215-237, (2015) · Zbl 1308.62041
[30] Loh, Po-Ru; Tucker, George; Bulik-Sullivan, Brendan K.; Vilhjalmsson, Bjarni J.; Finucane, Hilary K.; Salem, Rany M.; Chasman, Daniel I.; Ridker, Paul M.; Neale, Benjamin M.; Berger, Bonnie, Efficient Bayesian mixed-model analysis increases association power in large cohorts, Nature Genet., 47, 3, 284-290, (2015)
[31] Macosko, Evan Z.; Basu, Anindita; Satija, Rahul; Nemesh, James; Shekhar, Karthik; Goldman, Melissa; Tirosh, Itay; Bialas, Allison R.; Kamitaki, Nolan; Martersteck, Emily M., Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets, Cell, 161, 5, 1202-1214, (2015)
[32] McLachlan, Geoffrey; Krishnan, Thriyambakam, The EM algorithm and extensions, vol. 382, (2007), John Wiley & Sons · Zbl 0882.62012
[33] McLachlan, Geoffrey; Peel, David, Finite mixture models, (2004), John Wiley & Sons · Zbl 0963.62061
[34] Medvedovic, Mario; Sivaganesan, Siva, Bayesian infinite mixture model based clustering of gene expression profiles, Bioinformatics, 18, 9, 1194-1206, (2002)
[35] Miller, Jeffrey W.; Harrison, Matthew T., Mixture models with a prior on the number of components, J. Amer. Statist. Assoc., 0, 0, 1-17, (2017)
[36] Moser, Gerhard; Lee, Sang Hong; Hayes, Ben J.; Goddard, Michael E.; Wray, Naomi R.; Visscher, Peter M., Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model, PLoS Genet., 11, 4, e1004969, (2015)
[37] Murtagh, Fionn; Contreras, Pedro, Algorithms for hierarchical clustering: an overview, Wiley Interdiscip. Rev.: Data Min. Knowl. Discov., 2, 1, 86-97, (2012) · Zbl 1231.68214
[38] Ng, Andrew Y.; Jordan, Michael I.; Weiss, Yair, On spectral clustering: analysis and an algorithm, Adv. Neural Inf. Process. Syst., 2, 849-856, (2002)
[39] Pierson, Emma; Yau, Christopher, ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis, Genome Biol., 16, 241-251, (2015)
[40] Pizzuti, Clara; Talia, Domenico, P-autoclass: scalable parallel clustering for mining large data sets, IEEE Trans. Knowl. Data Eng., 15, 3, 629-641, (2003)
[41] Rasmussen, Carl Edward, 1999. The infinite Gaussian mixture model. In: NIPS, Vol. 12, pp. 554-560.; Rasmussen, Carl Edward, 1999. The infinite Gaussian mixture model. In: NIPS, Vol. 12, pp. 554-560.
[42] Spiegelhalter, David J.; Best, Nicola G.; Carlin, Bradley P.; Van Der Linde, Angelika, Bayesian measures of model complexity and fit, J. Roy. Statist. Soc. Ser. B (Statist. Methodol.), 64, 4, 583-639, (2002) · Zbl 1067.62010
[43] van der Maaten, Laurens; Hinton, Geoffrey, Visualizing data using t-sne, J. Mach. Learn. Res., 9, 2579-2605, 85, (2008) · Zbl 1225.68219
[44] Wang, Yu-Xiang; Xu, Huan, Noisy sparse subspace clustering, J. Mach. Learn. Res., 17, 12, 1-41, (2016) · Zbl 1360.62361
[45] White, Arthur; Murphy, Thomas Brendan, Bayeslca: an r package for Bayesian latent class analysis, J. Stat. Softw., 61, 1, 1-28, (2014)
[46] White, Arthur, Murphy, Thomas Brendan, 2014b. Bayeslca: an r package for bayesian latent class analysis.; White, Arthur, Murphy, Thomas Brendan, 2014b. Bayeslca: an r package for bayesian latent class analysis.
[47] Yerebakan, Halid Z.; Rajwa, Bartek; Dundar, Murat, The infinite mixture of infinite Gaussian mixtures, (Advances in Neural Information Processing Systems, (2014)), 28-36
[48] Zhang, Ruqi, Lu, Zhiwu, 2016. Large Scale Sparse Clustering. In: IJCAI, pp. 2336-2342.; Zhang, Ruqi, Lu, Zhiwu, 2016. Large Scale Sparse Clustering. In: IJCAI, pp. 2336-2342.
[49] Zhang, Peng; Wang, Xiaogang; Song, Peter X.-K., Clustering categorical data based on distance vectors, J. Amer. Statist. Assoc., 101, 473, 355-367, (2006) · Zbl 1118.62341
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.