Reliable clustering of Bernoulli mixture models. (English) Zbl 1466.62357

Summary: A Bernoulli Mixture Model (BMM) is a finite mixture of random binary vectors with independent dimensions. The problem of clustering BMM data arises in a variety of real-world applications, ranging from population genetics to activity analysis in social networks. In this paper, we analyze the clusterability of BMMs from a theoretical perspective, when the number of clusters is unknown. In particular, we stipulate a set of conditions on the sample complexity and dimension of the model in order to guarantee the Probably Approximately Correct (PAC)-clusterability of a dataset. To the best of our knowledge, these findings are the first non-asymptotic bounds on the sample complexity of learning or clustering BMMs.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI arXiv Euclid


[1] Allman, E.S., Matias, C. and Rhodes, J.A. (2009). Identifiability of parameters in latent structure models with many observed variables. Ann. Statist. 37 3099-3132. · Zbl 1191.62003
[2] Ashtiani, H., Ben-David, S., Harvey, N., Liaw, C., Mehrabian, A. and Plan, Y. (2018). Nearly tight sample complexity bounds for learning mixtures of Gaussians via sample compression schemes. In Advances in Neural Information Processing Systems 3412-3421. · Zbl 1499.68298
[3] Baker, L.D. and McCallum, A.K. (1998). Distributional clustering of words for text classification. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 96-103. ACM.
[4] Balakrishnan, S., Wainwright, M.J. and Yu, B. (2017). Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Statist. 45 77-120. · Zbl 1367.62052
[5] Biernacki, C., Celeux, G. and Govaert, G. (1999). An improvement of the NEC criterion for assessing the number of clusters in a mixture model. Pattern Recogn. Lett. 20 267-272. · Zbl 0933.68117
[6] Bishop, C.M. (2006). Pattern recognition and machine learning. Mach. Learn. 128 1-58. · Zbl 1107.68072
[7] Bouveyron, C. and Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: A review. Comput. Statist. Data Anal. 71 52-78. · Zbl 1471.62032
[8] Carreira-Perpinán, M.A. and Renals, S. (2000). Practical identifiability of finite mixtures of multivariate Bernoulli distributions. Neural Comput. 12 141-152.
[9] Catchen, J., Hohenlohe, P.A., Bassham, S., Amores, A. and Cresko, W.A. (2013). Stacks: An analysis tool set for population genomics. Mol. Ecol. 22 3124-3140.
[10] Celeux, G. and Soromenho, G. (1996). An entropy criterion for assessing the number of clusters in a mixture model. J. Classification 13 195-212. · Zbl 0861.62051
[11] Chan, S.-O., Diakonikolas, I., Servedio, R.A. and Sun, X. (2014). Efficient density estimation via piecewise polynomial approximation. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing 604-613. ACM. · Zbl 1315.68163
[12] Courant, R. (2011). Differential and Integral Calculus. Vol. II. Wiley Classics Library. New York: Wiley. · Zbl 1245.26001
[13] Cover, T.M. and Thomas, J.A. (2012). Elements of Information Theory, 2nd ed. Hoboken, NJ: Wiley Interscience. · Zbl 0762.94001
[14] Diakonikolas, I. (2016). Learning structured distributions. In Handbook of Big Data. Chapman & Hall/CRC Handb. Mod. Stat. Methods 267-283. Boca Raton, FL: CRC Press.
[15] Evanno, G., Regnaut, S. and Goudet, J. (2005). Detecting the number of clusters of individuals using the software structure: A simulation study. Mol. Ecol. 14 2611-2620.
[16] Falush, D., Stephens, M. and Pritchard, J.K. (2003). Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164 1567-1587.
[17] Figueiredo, M.A.T. and Jain, A.K. (2002). Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24 381-396.
[18] Fjellstad, O.-E. and Fossen, T.I. (2016). A generalized multivariate logistic model and EM algorithm based on the normal variance mean mixture representation. In Statistical Signal Processing Workshop (SSP) 1-5. IEEE.
[19] Fraley, C. and Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611-631. · Zbl 1073.62545
[20] Gershman, S.J. and Blei, D.M. (2012). A tutorial on Bayesian nonparametric models. J. Math. Psych. 56 1-12. · Zbl 1237.62062
[21] Gyllenberg, M., Koski, T., Reilink, E. and Verlaan, M. (1994). Nonuniqueness in probabilistic numerical identification of bacteria. J. Appl. Probab. 31 542-548. · Zbl 0817.92002
[22] Hollander, M., Wolfe, D.A. and Chicken, E. (2014). Nonparametric Statistical Methods, 3rd ed. Wiley Series in Probability and Statistics. Hoboken, NJ: Wiley. · Zbl 1279.62006
[23] Juan, A., García-Hernández, J. and Vidal, E. (2004). EM initialisation for Bernoulli mixture learning. In Structural, Syntactic, and Statistical Pattern Recognition 635-643. · Zbl 1104.68626
[24] Juan, A. and Vidal, E. (2004). Bernoulli mixture models for binary images. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on 3 367-370. IEEE.
[25] Kalai, A.T., Moitra, A. and Valiant, G. (2016). Disentangling Gaussians. Commun. ACM 55 113-120.
[26] Kopelman, N.M., Mayzel, J., Jakobsson, M., Rosenberg, N.A. and Mayrose, I. (2015). Clumpak: A program for identifying clustering modes and packaging population structure inferences across K. Mol. Ecol. Resour. 15 1179-1191.
[27] Lazarsfeld, P.F., Henry, N.W. and Anderson, T.W. (1968). Latent Structure Analysis 109. Boston, MA: Houghton Mifflin.
[28] Li, C., Wang, B., Pavlu, V. and Aslam, J. (2016). Conditional Bernoulli mixtures for multi-label classification. In Proceedings of the 33rd International Conference on Machine Learning 2482-2491.
[29] McLachlan, G. and Peel, D. (2004). Finite Mixture Models. Wiley Series in Probability and Statistics: Applied Probability and Statistics. New York: Wiley Interscience.
[30] McNicholas, P.D. (2016). Model-based clustering. J. Classification 33 331-373. · Zbl 1364.62155
[31] Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2012). Foundations of Machine Learning. Adaptive Computation and Machine Learning. Cambridge, MA: MIT Press. · Zbl 1318.68003
[32] Müller, P., Quintana, F.A., Jara, A. and Hanson, T. (2015). Bayesian Nonparametric Data Analysis. Springer Series in Statistics. Cham: Springer. · Zbl 1333.62003
[33] Najafi, A., Janghorbani, S., Motahari, S.A. and Fatemizadeh, E. (2019). Statistical association mapping of population-structured genetic data. IEEE/ACM Trans. Comput. Biol. Bioinform. 16 638-649.
[34] Orbanz, P. and Teh, Y.W. (2011). Bayesian nonparametric models. In Encyclopedia of Machine Learning 81-89. Springer.
[35] Peakall, R. and Smouse, P.E. (2006). GENALEX 6: Genetic analysis in Excel. Population genetic software for teaching and research. Mol. Ecol. Notes 6 288-295.
[36] Pella, J. and Masuda, M. (2006). The Gibbs and split merge sampler for population mixture analysis from genetic data with incomplete baselines. Can. J. Fish. Aquat. Sci. 63 576-596.
[37] Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904-909.
[38] Pritchard, J.K., Stephens, M. and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics 155 945-959.
[39] Pritchard, J.K., Stephens, M., Rosenberg, N.A. and Donnelly, P. (2000). Association mapping in structured populations. Am. J. Hum. Genet. 67 170-181.
[40] Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., De Bakker, P.I., Daly, M.J. et al. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81 559-575.
[41] Rousseau, J. (2016). On the frequentist properties of Bayesian nonparametric methods. Annu. Rev. Stat. Appl. 3 211-231.
[42] Studenỳ, M. and Vejnarová, J. (1998). The multiinformation function as a tool for measuring stochastic dependence. In Learning in Graphical Models 261-297. Springer. · Zbl 0917.60013
[43] Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M. (2005). Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems 1385-1392.
[44] Tiedeman, D. (1955). On the study of types. In Symposium on Pattern Analysis 1-14.
[45] Visscher, P.M., Brown, M.A., McCarthy, M.I. and Yang, J. (2012). Five years of GWAS discovery. Am. J. Hum. Genet. 90 7-24.
[46] Watanabe, S. (1960). Information theoretical analysis of multivariate correlation. IBM J. Res. Develop. 4 66-82. · Zbl 0097.35003
[47] Wolfe, J.H. (1970). Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 5 329-350.
[48] Yu, J., Pressoir, G., Briggs, W.H., Bi, I.V., Yamasaki, M., Doebley, J.F., McMullen, M.D., Gaut, B.S., Nielsen, D.M., Holland, J.B. et al. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38 203-208.
[49] Zhou, H., Blangero, J., Dyer, T.D., Chan, K.K., Lange, K. and Sobel, E.M. (2017). Fast genome-wide QTL association mapping on pedigree and population data. Genet. Epidemiol. 41 174-186.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.