×

Block clustering with collapsed latent block models. (English) Zbl 1322.62046

Summary: We introduce a Bayesian extension of the latent block model for model-based block clustering of data matrices. Our approach considers a block model where block parameters may be integrated out. The result is a posterior defined over the number of clusters in rows and columns and cluster memberships. The number of row and column clusters need not be known in advance as these are sampled along with cluster memberhips using Markov chain Monte Carlo. This differs from existing work on latent block models, where the number of clusters is assumed known or is chosen using some information criteria. We analyze both simulated and real data to validate the technique.

MSC:

62-07 Data analysis (statistics) (MSC2010)
62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) · Zbl 1112.68379
[2] Bozdogan, H.: Mixture-model cluster analysis using model selection criteria and a new information measure of complexity. In: Bozdogan, H. (ed.) Proceedings of the first US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, vol. 2, pp. 69–113. Kluwer Academic, Boston (1994)
[3] Brooks, S.P., Giudici, P., Roberts, G.O.: Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions (with discussion). J. R. Stat. Soc., Ser. B, Stat. Methodol. 65, 3–39 (2003) · Zbl 1063.62120
[4] Carpaneto, G., Martello, S., Toth, P.: Algorithms and codes for the assignment problem. Ann. Oper. Res. 13, 193–223 (1988)
[5] Carpaneto, G., Toth, P.: Algorithm 548: Solution of the assignment problem. ACM Trans. Math. Softw. 6, 104–111 (1980) · Zbl 0445.90089
[6] Celeux, G., Hurn, M., Robert, C.P.: Computational and inferential difficulties with mixtures posterior distribution. J. Am. Stat. Assoc. 95, 957–979 (2000) · Zbl 0999.62020
[7] Cheng, Y., Church, G.M.: Biclustering of expression data. In: ISMB 2000 Proceedings, pp. 93–103 (2000)
[8] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc., Ser. B, Stat. Methodol. 39, 1–38 (1977) · Zbl 0364.62022
[9] Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002) · Zbl 1073.62545
[10] Getz, G., Levine, E., Domany, E.: Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA 97, 12079–12084 (2000)
[11] Govaert, G., Nadif, M.: Block clustering with Bernoulli mixture models: Comparison of different approaches. Comput. Stat. Data Anal. 52, 3233–3245 (2008) · Zbl 1452.62444
[12] Green, P.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732 (1995) · Zbl 0861.62023
[13] Green, P.J.: Trans-Dimensional Markov chain Monte Carlo. In: Green, P.J., Hjord, N.L., Richardson, S. (eds.) Highly Structured Stochastic Systems, pp. 179–198. Oxford University Press, Oxford (2003)
[14] Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. USA 101, 5228–5235 (2004)
[15] Hartigan, J.A.: Direct clustering of a data matrix. J. Am. Stat. Assoc. 67, 123–129 (1972)
[16] Hartigan, J.A.: Bloc voting in the United States senate. J. Classif. 17, 29–49 (2000) · Zbl 1103.91335
[17] Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42, 177–196 (2001) · Zbl 0970.68130
[18] Kaiser, S., Santamaria, R., Sill, M., Theron, R., Quintales, L., Leisch, F.: Biclust: BiCluster Algorithms. R package version 0.9.1. (2009)
[19] Kluger, Y., Basri, R., Chang, J.T., Gerstein, M.: Spectral Biclustering of microarray data: Coclustering genes and conditions. Genome Res. 13, 703–716 (2003)
[20] Lazzeroni, L., Owen, A.: Plaid models for gene expression data. Stat. Sin. 12, 61–86 (2002) · Zbl 1004.62084
[21] Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer, New York (2004)
[22] Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, pp. 355–358. Kluwer Academic, Dordrecht (1998) · Zbl 0916.62019
[23] Nobile, A.: Bayesian finite mixtures: a note on prior specification and posterior computation. Technical report. Department of Statistics, University of Glasgow (2005)
[24] Nobile, A., Fearnside, A.T.: Bayesian finite mixtures with an unknown number of components: The allocation sampler. Stat. Comput. 17, 147–162 (2007)
[25] Phillips, D.B., Smith, A.F.M.: Bayesian model comparison via jump diffusions. In: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.) Markov Chain Monte Carlo in Practice, pp. 215–239. Chapman & Hall, London (1996) · Zbl 0855.62018
[26] Richardson, S., Green, P.J.: On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. R. Stat. Soc. B 59, 731–792 (1997) · Zbl 0891.62020
[27] Robert, C.P., Rydén, T., Titterington, D.M.: Bayesian inference in hidden Markov models through the reversible jump Markov chain Monte Carlo method. J. R. Stat. Soc. B 62, 57–76 (2000) · Zbl 0941.62090
[28] Roberts, G.O.: Markov chain concepts related to sampling algorithms. In: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.) Markov Chain Monte Carlo in Practice, pp. 45–58. Chapman & Hall, London (1996) · Zbl 0839.62078
[29] Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978) · Zbl 0379.62005
[30] Sheng, Q., Moreau, Y., Moor, B.D.: Biclustering microarray data by Gibbs sampling. Bioinformatics 19, 196–205 (2003)
[31] Spiegelhalter, D.J., Best, N.G., Gilks, W.R., Inskip, H.: Hepatitis B: a case study in MCMC methods. In: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.) Markov Chain Monte Carlo in Practice, pp. 21–44. Chapman & Hall, London (1996) · Zbl 0840.92012
[32] Stephens, M.: Bayesian analysis of mixture models with an unknown number of components- an alternative to reversible jump methods. Ann. Stat. 28, 40–74 (2000) · Zbl 1106.62316
[33] Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P.: Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University (1999)
[34] van Dijk, B., van Rosmalen, J., Paap, R.: A Bayesian approach to two-mode clustering. Technical report, Econometric Institute Report, Erasmus University Rotterdam (2009)
[35] Wit, E., McClure, J.: Statistics for Microarrays: Design, Analysis and Inference. Wiley, New York (2004) · Zbl 1049.62120
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.