Goodness-of-fit test for latent block models. (English) Zbl 07345038

Summary: Latent block models are used for probabilistic biclustering, which is shown to be an effective method for analyzing various relational data sets. However, there has been no statistical test method for determining the row and column cluster numbers of latent block models. Recent studies have constructed statistical-test-based methods for stochastic block models, which assume that the observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns. In this study, we developed a new goodness-of-fit test for latent block models to test whether an observed data matrix fits a given set of row and column cluster numbers, or it consists of more clusters in at least one direction of the row and the column. To construct the test method, we used a result from the random matrix theory for a sample covariance matrix. We experimentally demonstrated the effectiveness of the proposed method by showing the asymptotic behavior of the test statistic and measuring the test accuracy.


62-XX Statistics


Full Text: DOI arXiv


[1] Ames, B. P.W., Guaranteed clustering and biclustering via semidefinite programming, Math. Program., 147, 1, 429-465 (2014) · Zbl 1297.90107
[2] Arabie, P.; Boorman, S. A.; Levitt, P. R., Constructing blockmodels: How and why, J. Math. Psych., 17, 1, 21-63 (1978) · Zbl 0375.92001
[3] Bai, Z. D.; Yin, Y. Q., Limit of the smallest eigenvalue of a large dimensional sample covariance matrix, Ann. Probab., 21, 3, 1275-1294 (1993) · Zbl 0779.60026
[4] Bao, Z.; Pan, G.; Zhou, W., Universality for the largest eigenvalue of sample covariance matrices with general population, Ann. Statist., 43, 1, 382-421 (2015) · Zbl 1408.60006
[5] Bickel, P. J.; Sarkar, P., Hypothesis testing for automated community detection in networks, J. R. Stat. Soc. Ser. B Stat. Methodol., 78, 1, 253-273 (2016) · Zbl 1411.62162
[6] Bloemendal, A.; Knowles, A.; Yau, H.-T.; Yin, J., On the principal components of sample covariance matrices, Probab. Theory Related Fields, 164, 459-552 (2016) · Zbl 1339.15023
[7] Brault, V.; Channarond, A., Fast and consistent algorithm for the latent block model (2016), arXiv:1610.09005
[8] Chen, K.; Lei, J., Network cross-validation for determining the number of communities in network data, J. Amer. Statist. Assoc., 113, 521, 241-251 (2018) · Zbl 1398.62159
[9] Conover, W. J., Practical Nonparametric Statistics (1999), John Wiley & Sons, New York
[10] Corneli, M., Latouche, P., Rossi, F., 2015. Exact ICL maximization in a non-stationary time extension of the latent block model for dynamic networks. In: Proceedings of the 23-th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. pp. 225-230.
[11] Dabbs, B.; Junker, B., Comparison of cross-validation methods for stochastic block models (2016), arXiv:1605.03000
[12] Dhillon, I.S., 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 269-274.
[13] Ding, X.; Yang, F., A necessary and sufficient condition for edge universality at the largest singular values of covariance matrices, Ann. Probab., 28, 3, 1679-1738 (2018) · Zbl 1426.15052
[14] Dua, D.; Graff, C., UCI Machine Learning Repository (2017), University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml
[15] Flynn, C. J.; Perry, P. O., Profile likelihood biclustering, Electron. J. Stat., 14, 1, 731-768 (2020) · Zbl 1435.62229
[16] Geman, S., A limit theorem for the norm of random matrices, Ann. Probab., 8, 2, 252-261 (1980) · Zbl 0428.60039
[17] Govaert, G.; Nadif, M., Clustering with block mixture models, Pattern Recognit., 36, 463-473 (2003)
[18] Hartigan, J. A., Direct clustering of a data matrix, J. Amer. Statist. Assoc., 67, 337, 123-129 (1972)
[19] Holland, P. W.; Laskey, K. B.; Leinhardt, S., Stochastic blockmodels: First steps, Social Networks, 5, 109-137 (1983)
[20] Hu, J.; Qin, H.; Yan, T.; Zhao, Y., Corrected Bayesian information criterion for stochastic block models, J. Amer. Statist. Assoc., 1-13 (2019)
[21] Hu, J.; Zhang, J.; Qin, H.; Yan, T.; Zhu, J., Using maximum entry-wise deviation to test the goodness of fit for stochastic block models, J. Amer. Statist. Assoc., 1-10 (2020)
[22] Johansson, K., Shape fluctuations and random matrices, Comm. Math. Phys., 209, 437-476 (2000) · Zbl 0969.15008
[23] Johnstone, I. M., On the distribution of the largest eigenvalue in principal components analysis, Ann. Statist., 29, 2, 295-327 (2001) · Zbl 1016.62078
[24] Karwa, V.; Pati, D.; Petrović, S.; Solus, L.; Alexeev, N.; Raič, M.; Wilburne, D.; Williams, R.; Yan, B., Exact tests for stochastic block models (2016), arXiv:1612.06040
[25] Kawamoto, T.; Kabashima, Y., Cross-validation estimate of the number of clusters in a network, Sci. Rep., 7, 3327 (2017)
[26] Keribin, C., Brault, V., Celeux, G., Govaert, G., 2012. Model selection for the binary latent block model. In: Proceedings of 20th International Conference on Computational Statistics. pp. 379-390. · Zbl 1331.62149
[27] Keribin, C.; Brault, V.; Celeux, G.; Govaert, G., Estimation and selection for the latent block model on categorical data, Stat. Comput., 25, 1201-1216 (2015) · Zbl 1331.62149
[28] Labiod, L., Nadif, M., 2011. Modularity and spectral co-clustering for categorical data. In: Proceedings of the International Conference on Data Mining. pp. 386-392.
[29] Lei, J., A goodness-of-fit test for stochastic block models, Ann. Statist., 44, 1, 401-424 (2016) · Zbl 1331.62283
[30] Li, T.; Levina, E.; Zhu, J., Network cross-validation by edge sampling, Biometrika, 107, 2, 257-276 (2020) · Zbl 1441.62049
[31] Lomet, A., Govaert, G., Grandvalet, Y., 2012. Model selection in block clustering by the integrated classification likelihood. In: Proceedings of 20th International Conference on Computational Statistics. pp. 519-530. · Zbl 1416.62349
[32] Ma, Z., Accuracy of the Tracy-Widom limits for the extreme eigenvalues in white Wishart matrices, Bernoulli, 18, 1, 322-359 (2012) · Zbl 1248.60010
[33] Mariadassou, M.; Matias, C., Convergence of the groups posterior distribution in latent or stochastic block models, Bernoulli, 21, 1, 537-573 (2015) · Zbl 1329.62285
[34] Nakano, M., Ishiguro, K., Kimura, A., Yamada, T., Ueda, N., 2014. Rectangular tiling process. In: Proceedings of the 31st International Conference on Machine Learning. pp. 361-369.
[35] Passino, F. S.; Heard, N. A., Bayesian estimation of the latent dimension and communities in stochastic blockmodels, Stat. Comput., 30, 1291-1307 (2020) · Zbl 1452.62404
[36] Péché, S., Universality results for the largest eigenvalues of some sample covariance matrix ensembles, Probab. Theory Related Fields, 143, 481-516 (2009) · Zbl 1167.62019
[37] Peixoto, T. P., Parsimonious module inference in large networks, Phys. Rev. Lett., 110, Article 148701 pp. (2013)
[38] Pillai, N. S.; Yin, J., Universality of covariance matrices, Ann. Appl. Probab., 24, 3, 935-1001 (2014) · Zbl 1296.15021
[39] Pontes, B.; Giráldez, R.; Aguilar-Ruiz, J. S., Biclustering on expression data: A review, J. Biomed. Inform., 57, 163-180 (2015)
[40] Rastelli, R.; Friel, N., Optimal Bayesian estimators for latent variable cluster models, Stat. Comput., 28, 1169-1186 (2018) · Zbl 1430.62140
[41] Robert, V.; Vasseur, Y., Comparing high dimensional partitions, with the coclustering adjusted rand index (2017), arXiv:1705.06760
[42] Roy, D. M.; Teh, Y. W., The Mondrian process, (Advances in Neural Information Processing Systems 21 (2008)), 1377-1384
[43] Saldaña, D. F.; Yu, Y.; Feng, Y., How many communities are there?, J. Comput. Graph. Statist., 26, 1, 171-181 (2017)
[44] Shan, H., Banerjee, A., 2008. Bayesian co-clustering. In: Proceedings of the 8th IEEE International Conference on Data Mining. pp. 530-539.
[45] Silverstein, J. W., The smallest eigenvalue of a large dimensional Wishart matrix, Ann. Probab., 13, 4, 1364-1368 (1985) · Zbl 0591.60025
[46] Soshnikov, A., A note on universality of the distribution of the largest eigenvalues in certain sample covariance matrices, J. Stat. Phys., 108, 1033-1056 (2002) · Zbl 1018.62042
[47] Tracy, C. A.; Widom, H., The distributions of random matrix theory and their applications, (New Trends in Mathematical Physics (2009), Springer), 753-765 · Zbl 1176.15046
[48] van der Vaart, A. W., Asymptotic Statistics (1998), Cambridge University Press · Zbl 0910.62001
[49] Ward, J. H., Hierarchical grouping to optimize an objective function, J. Amer. Statist. Assoc., 58, 301, 236-244 (1963)
[50] Wyse, J.; Friel, N., Block clustering with collapsed latent block models, Stat. Comput., 22, 415-428 (2012) · Zbl 1322.62046
[51] Wyse, J.; Friel, N.; Latouche, P., Inferring structure in bipartite networks using the latent blockmodel and exact ICL, Netw. Sci., 5, 1, 45-69 (2017)
[52] Yin, Y. Q.; Bai, Z. D.; Krishnaiah, P. R., On the limit of the largest eigenvalue of the large dimensional sample covariance matrix, Probab. Theory Related Fields, 78, 509-521 (1988) · Zbl 0627.62022
[53] Yuan, M.; Feng, Y.; Shang, Z., A likelihood-ratio type test for stochastic block models with bounded degrees (2018), arXiv:1807.04426
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.