Robust and computationally feasible community detection in the presence of arbitrary outlier nodes.

*(English)*Zbl 1328.62381Summary: Community detection, which aims to cluster \(N\) nodes in a given graph into \(r\) distinct groups based on the observed undirected edges, is an important problem in network data analysis. In this paper, the popular stochastic block model (SBM) is extended to the generalized stochastic block model (GSBM) that allows for adversarial outlier nodes, which are connected with the other nodes in the graph in an arbitrary way. Under this model, we introduce a procedure using convex optimization followed by \(k\)-means algorithm with \(k=r\).{

}Both theoretical and numerical properties of the method are analyzed. A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with \(N\). This theoretical result admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers. Numerical results show that our method is both computationally fast and robust to different kinds of outliers, while some popular computationally fast community detection algorithms, such as spectral clustering applied to adjacency matrices or graph Laplacians, may fail to retrieve the major clusters due to a small portion of outliers. We apply a slight modification of our method to a political blogs data set, showing that our method is competent in practice and comparable to existing computationally feasible methods in the literature. To the best of the authors’ knowledge, our result is the first in the literature in terms of clustering communities with fast growing numbers under the GSBM where a portion of arbitrary outlier nodes exist.

}Both theoretical and numerical properties of the method are analyzed. A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with \(N\). This theoretical result admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers. Numerical results show that our method is both computationally fast and robust to different kinds of outliers, while some popular computationally fast community detection algorithms, such as spectral clustering applied to adjacency matrices or graph Laplacians, may fail to retrieve the major clusters due to a small portion of outliers. We apply a slight modification of our method to a political blogs data set, showing that our method is competent in practice and comparable to existing computationally feasible methods in the literature. To the best of the authors’ knowledge, our result is the first in the literature in terms of clustering communities with fast growing numbers under the GSBM where a portion of arbitrary outlier nodes exist.

##### MSC:

62H30 | Classification and discrimination; cluster analysis (statistical aspects) |

91C20 | Clustering in the social and behavioral sciences |

##### Software:

PhaseLift##### References:

[1] | Adamic, A. and Glance, N. (2005). The political blogosphere and the 2004 US election: Divided they blog. In Proceedings of the 3 rd International Workshop on Link Discovery 36-43. ACM, New York. |

[2] | Ahlswede, R. and Winter, A. (2002). Strong converse for identification via quantum channels. IEEE Trans. Inform. Theory 48 569-579. · Zbl 1071.94530 |

[3] | Airoldi, E., Blei, M., Fienberg, S. and Xing, E. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981-2014. · Zbl 1225.68143 |

[4] | Ames, B. P. W. (2014). Guaranteed clustering and biclustering via semidefinite programming. Math. Program. 147 429-465. · Zbl 1297.90107 |

[5] | Ames, B. P. W. and Vavasis, S. A. (2014). Convex optimization for the planted \(k\)-disjoint-clique problem. Math. Program. 143 299-337. · Zbl 1291.90194 |

[6] | Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097-2122. · Zbl 1277.62166 |

[7] | Balakrishnan, S., Xu, M., Krishnamurthy, A. and Singh, A. (2011). Noise thresholds for spectral clustering (NIPS 2011). Adv. Neural Inf. Process. Syst. 25 954-962. |

[8] | Bhattacharyya, S. and Bickel, P. J. (2014). Community detection in networks using graph distance. Available at . arXiv:1401.3915 |

[9] | Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and Newman-Girvan and other modularities. Proc. Natl. Acad. Sci. USA 106 21068-21073. · Zbl 1359.62411 |

[10] | Bickel, P., Choi, D., Chang, X. and Zhang, H. (2013). Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. 41 1922-1943. · Zbl 1292.62042 |

[11] | Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Faund. Trends Mach. Learn. 3 1-122. · Zbl 1229.90122 |

[12] | Cai, T. and Li, X. (2015). Supplement to “Robust and computationally feasible community detection in the presence of arbitrary outlier nodes.” . · Zbl 1328.62381 |

[13] | Candès, E. J., Strohmer, T. and Voroninski, V. (2013). PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66 1241-1274. · Zbl 1335.94013 |

[14] | Candès, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37. · Zbl 1327.62369 |

[15] | Celisse, A., Daudin, J.-J. and Pierre, L. (2012). Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 6 1847-1899. · Zbl 1295.62028 |

[16] | Chaudhuri, K., Chung, F. and Tsiatas, A. (2012). Spectral clustering of graphs with general degrees in the extended planted partition model. J. Mach. Learn. Res. 23 35.1-35.23. |

[17] | Chen, Y., Sanghavi, S. and Xu, H. (2012). Clustering sparse graphs. Adv. Neural Inf. Process. Syst. 25 2213-2221. |

[18] | Chernoff, H. (1981). A note on an inequality involving the normal distribution. Ann. Probab. 9 533-535. · Zbl 0457.60014 |

[19] | Clauset, A., Newman, M. and Moore, C. (2004). Finding community structure in very large networks. Phys. Rev. E 70 066111. |

[20] | Coja-Oghlan, A. and Lanka, A. (2009/10). Finding planted partitions in random graphs with general degree distributions. SIAM J. Discrete Math. 23 1682-1714. · Zbl 1207.05178 |

[21] | Decelle, A., Krzakala, F., Moore, C. and Zdeborová, L. (2011). Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84 066106. |

[22] | Deshpande, Y. and Montanari, A. (2015). Finding hidden cliques of size \(\sqrt{N/e}\) in nearly linear time. Found. Comput. Math. . · Zbl 1347.05227 |

[23] | Fienberg, S. E. (2010). Introduction to papers on the modeling and analysis of network data. Ann. Appl. Stat. 4 1-4. · Zbl 05782530 |

[24] | Fienberg, S. E. (2012). A brief history of statistical models for network analysis and open challenges. J. Comput. Graph. Statist. 21 825-839. |

[25] | Fishkind, D. E., Sussman, D. L., Tang, M., Vogelstein, J. T. and Priebe, C. E. (2013). Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown. SIAM J. Matrix Anal. Appl. 34 23-39. · Zbl 1314.05186 |

[26] | Füredi, Z. and Komlós, J. (1981). The eigenvalues of random symmetric matrices. Combinatorica 1 233-241. · Zbl 0494.15010 |

[27] | Giesen, J. and Mitsche, D. (2005). Reconstructing many partitions using spectral techniques. In Fundamentals of Computation Theory. Lecture Notes in Computer Science 3623 433-444. Springer, Berlin. · Zbl 1122.05079 |

[28] | Goldenberg, A., Zheng, A. X., Fienberg, S. E. and Airoldi, E. M. (2010). A survey of statistical network models. Foundations and Trends in Machine Learning 2 129-233. · Zbl 1184.68030 |

[29] | Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301-354. · Zbl 05273954 |

[30] | Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Soc. Netw. 5 109-137. |

[31] | Horn, R. A. and Johnson, C. R. (2013). Matrix Analysis , 2nd ed. Cambridge Univ. Press, Cambridge. · Zbl 1267.15001 |

[32] | Jalali, A., Chen, Y., Sanghavi, S. and Xu, H. (2014). Clustering partially observed graphs via convex optimization. J. Mach. Learn. Res. 15 2213-2238. · Zbl 1319.62123 |

[33] | Jin, J. (2015). Fast network community detection by SCORE. Ann. Statist. 43 57-89. · Zbl 1310.62076 |

[34] | Joseph, A. and Yu, B. (2013). Impact of regularization on spectral clustering. Available at . arXiv:1312.1733 |

[35] | Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Phys. Rev. E (3) 83 016107, 10. |

[36] | Krzakala, F., Moore, C., Mossel, E., Neeman, J., Sly, A., Zdeborová, L. and Zhang, P. (2013). Spectral redemption in clustering sparse networks. Proc. Natl. Acad. Sci. USA 110 20935-20940. · Zbl 1359.62252 |

[37] | Kumar, A., Sabharwal, Y. and Sen, S. (2011). A simple linear time \((1+\epsilon)\)-approximation algorithm for \(k\)-means clustering in any dimensions. J. ACM 58 11. |

[38] | Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. Ann. Statist. 43 215-237. · Zbl 1308.62041 |

[39] | Li, X. and Voroninski, V. (2013). Sparse signal recovery from quadratic measurements via convex programming. SIAM J. Math. Anal. 45 3019-3033. · Zbl 1320.94023 |

[40] | Lin, Z., Liu, R. and Su, Z. (2011). Linearized alternating direction method with adaptive penalty for low rank representation. In Advances in Neural Information Processing Systems ( NIPS ) 612-620. |

[41] | Mathieu, C. and Schudy, W. (2010). Correlation clustering with noisy input. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms 712-728. SIAM, Philadelphia, PA. · Zbl 1288.68197 |

[42] | McSherry, F. (2001). Spectral partitioning of random graphs. In 42 nd IEEE Symposium on Foundations of Computer Science ( Las Vegas , NV , 2001) 529-537. IEEE Computer Soc., Los Alamitos, CA. |

[43] | Newman, M. and Girvan, M. (2004). Finding and evaluating community structure in networks. Phys. Rev. E 69 026113. · Zbl 1032.91716 |

[44] | Newman, M. and Leicht, E. (2007). Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA 104 9564-9569. · Zbl 1155.91026 |

[45] | Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077-1087. · Zbl 1072.62542 |

[46] | Oymak, S. and Hassibi, B. (2011). Finding dense clusters via low rank \(+\) sparse decomposition. Available at . arXiv:1104.5186 |

[47] | Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Statist. 39 1878-1915. · Zbl 1227.62042 |

[48] | Sarkar, P. and Bickel, P. J. (2013). Role of normalization in spectral clustering for stochastic blockmodels. Available at . arXiv:1310.1495 · Zbl 1320.62150 |

[49] | Shamir, R. and Tsur, D. (2007). Improved algorithms for the random cluster graph model. Random Structures Algorithms 31 418-449. · Zbl 1129.05049 |

[50] | Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification 14 75-100. · Zbl 0896.62063 |

[51] | Sussman, D. L., Tang, M., Fishkind, D. E. and Priebe, C. E. (2012). A consistent adjacency spectral embedding for stochastic blockmodel graphs. J. Amer. Statist. Assoc. 107 1119-1128. · Zbl 1443.62188 |

[52] | Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12 389-434. · Zbl 1259.60008 |

[53] | Vu, V. H. (2007). Spectral norm of random matrices. Combinatorica 27 721-736. · Zbl 1164.05066 |

[54] | Vu, V. (2014). A simple SVD algorithm for finding hidden partitions. Available at . arXiv:1404.3918 |

[55] | Zhao, Y., Levina, E. and Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist. 40 2266-2292. · Zbl 1257.62095 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.