×

A biclustering algorithm for binary matrices based on penalized Bernoulli likelihood. (English) Zbl 1325.62013

Summary: We propose a new biclustering method for binary data matrices using the maximum penalized Bernoulli likelihood estimation. Our method applies a multi-layer model defined on the logits of the success probabilities, where each layer represents a simple bicluster structure and the combination of multiple layers is able to reveal complicated, multiple biclusters. The method allows for non-pure biclusters, and can simultaneously identify the 1-prevalent blocks and 0-prevalent blocks. A computationally efficient algorithm is developed and guidelines are provided for specifying the tuning parameters, including initial values of model parameters, the number of layers, and the penalty parameters. Missing-data imputation can be handled in the EM framework. The method is tested using synthetic and real datasets and shows good performance.

MSC:

62-07 Data analysis (statistics) (MSC2010)
62H30 Classification and discrimination; cluster analysis (statistical aspects)

Software:

EXPANDER; UCI-ml
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in gene expression data: the order-preserving sub-matrix problem. In: Proceedings of the 6th Annual International Conference on Computational Biology, pp. 49–57 (2002)
[2] Brookes, A.J.: Review: the essence of SNPs. Gene 234, 177–186 (1999) · doi:10.1016/S0378-1119(99)00219-X
[3] Cheng, Y., Church, G.L.: Biclustering of expression data. In: Proceedings of International Conference on Intelligence Systems for Molecular Biology, pp. 93–103 (2000)
[4] Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of principal component analysis to the exponential family. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advanced in Neural Information Processing System, vol. 14, pp. 617–pages 642. MIT Press, Cambridge (2002)
[5] de Leeuw, J.: Principal component analysis of binary data by iterated singular value decomposition. Comput. Stat. Data Anal. 50, 21–39 (2006) · Zbl 1429.62218 · doi:10.1016/j.csda.2004.07.010
[6] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977) · Zbl 0364.62022
[7] Dhollander, T., Sheng, Q., Lemmens, K., De Moore, B., Marchal, K., Moreau, Y.: Query-driven module discovery in microarray data. Bioinformatics 23, 2573–2580 (2007) · doi:10.1093/bioinformatics/btm387
[8] Ewens, W.J., Spielman, R.S.: The transmission/disequilibrium test: history, subdivision, and admixture. Am. J. Hum. Genet. 57, 455–464 (1995) · doi:10.1002/ajmg.1320570319
[9] Frank, A., Asuncion, A.: UCI machine learning repository. http://archive.ics.uci.edu/ml , Irvine, CA: University of California, School of Information and Computer Science (2010)
[10] Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1, 302–332 (2007) · Zbl 1378.90064 · doi:10.1214/07-AOAS131
[11] Gormley, I.C., Murphy, T.B.: A mixture of experts model for rank data with application in election studies. Ann. Appl. Stat. 2, 1452–1477 (2008) · Zbl 1454.62498 · doi:10.1214/08-AOAS178
[12] Govaert, G., Nadif, M.: Block clustering with Bernoulli mixture models: comparison of different approaches. Comput. Stat. Data Anal. 52, 3233–3245 (2008) · Zbl 1452.62444 · doi:10.1016/j.csda.2007.09.007
[13] Huber, P., Ronchetti, E., victoria-Feser, M.: Estimation of generalized linear latent variable models. J. R. Stat. Soc. B 66, 893–908 (2004) · Zbl 1060.62077 · doi:10.1111/j.1467-9868.2004.05627.x
[14] Hunter, D.R., Li, R.: Variable selection using MM algorithm. Ann. Stat. 33, 1617–1642 (2005) · Zbl 1078.62028 · doi:10.1214/009053605000000200
[15] Ihmels, J., Friedlander, G., Bergman, S., Sarig, O., Ziv, Y., Barkai, N.: Revealing modular organization in the yeast transcriptional network. Nat. Genet. 31, 370–377 (2002)
[16] Jaakkola, T.S., Jordan, M.I.: Bayesian parameter estimation via variational methods. Stat. Comput. 10, 25–37 (2000) · doi:10.1023/A:1008932416310
[17] Kwok, P.Y., Deng, Q., Zakeri, H., Taylor, S.L., Nicerson, D.A.: Increasing the information content of STS-based genome maps: identifying polymorphisms in mapped STSs. Genomics 31, 123–126 (1996) · doi:10.1006/geno.1996.0019
[18] Lange, K., Hunter, D.R., Yang, I.: Optimization transfer using surrogate objective function (with discussion). J. Comput. Graph. Stat. 9, 1–20 (2000)
[19] Lazzeroni, L., Owen, A.B.: Plaid models for gene expression data. Stat. Sin. 12, 61–86 (2002) · Zbl 1004.62084
[20] Lee, M., Shen, H., Huang, J.Z., Marron, J.S.: Biclustering via sparse singular value decomposition. Biometrics 66, 1087–1095 (2010a) · Zbl 1233.62182 · doi:10.1111/j.1541-0420.2010.01392.x
[21] Lee, S., Huang, J.Z., Hu, J.: Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 4, 1579–1601 (2010b) · Zbl 1202.62084 · doi:10.1214/10-AOAS327
[22] Murali, T.M., Kasif, S.: Extracting conserved gene expression motifs from gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 77–88 (2003) · Zbl 1219.92024
[23] Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22, 1122–1129 (2006) · doi:10.1093/bioinformatics/btl060
[24] Rodriguez-Baena, D.S., Perez-Pulido, A., Aguilar-Ruiz, J.S.: A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics 27, 2746–2753 (2011) · doi:10.1093/bioinformatics/btr464
[25] Schwarz, G.E.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978) · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[26] Serre, D., Montpetit, A., Paré, G., Engert, J.C., Yusuf, S., Keavney, B., Judson, T.J., Anand, S.: Correction of population stratification in large multi-ethnic association studies. PLoS ONE 2, e1382 (2008). doi: 10.1371/journal.pone.0001382
[27] Shamir, R., Maron-Katz, A., Tanay, A., Linhart, C., Steinfeld, I., Sharan, R., Shiloh, Y., Elkon, R.: EXPANDER–an integrative program suite for microarray data analysis. BMC Bioinform. 6, 232 (2005) · doi:10.1186/1471-2105-6-232
[28] Schein, A., Saul, L.K., Ungar, L.H.: A generalized linear model for principal component analysis of binary data. In: Bishop, C.M., Frey, B.J. (eds.) Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, pp. 14–21. Key West, FL (2003)
[29] Shen, H., Huang, J.Z.: Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal. 99, 1015–1034 (2008) · Zbl 1141.62049 · doi:10.1016/j.jmva.2007.06.007
[30] Sheng, Q., Moreau, Y., De Moor, B.: Biclustering microarray data by Gibbs sampling. Bioinformatics 19, 196–205 (2003) · doi:10.1093/bioinformatics/btg1078
[31] Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(suppl. 1), S136–S144 (2002) · doi:10.1093/bioinformatics/18.suppl_1.S136
[32] The International HapMap Consortium: A haplotype map of the human genome. Nature 437, 1299–1320 (2005) · doi:10.1038/nature04226
[33] Tibshirani, R.J.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996) · Zbl 0850.62538
[34] Van Uitert, M., Meuleman, W., Wessels, L.: Biclustering sparse binary genomic data. J. Comput. Biol. 15, 1329–1345 (2008) · doi:10.1089/cmb.2008.0066
[35] Wyse, J., Friel, N.: Block clustering with collapsed latent block models. Stat. Comput. 22, 415–428 (2012) · Zbl 1322.62046 · doi:10.1007/s11222-011-9233-4
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.