Bsig: evaluating the statistical significance of biclustering solutions. (English) Zbl 1416.62340

Summary: Statistical evaluation of biclustering solutions is essential to guarantee the absence of spurious relations and to validate the high number of scientific statements inferred from unsupervised data analysis without a proper statistical ground. Most biclustering methods rely on merit functions to discover biclusters with specific homogeneity criteria. However, strong homogeneity does not guarantee the statistical significance of biclustering solutions. Furthermore, although some biclustering methods test the statistical significance of specific types of biclusters, there are no methods to assess the significance of flexible biclustering models. This work proposes a method to evaluate the statistical significance of biclustering solutions. It integrates state-of-the-art statistical views on the significance of local patterns and extends them with new principles to assess the significance of biclusters with additive, multiplicative, symmetric, order-preserving and plaid coherencies. The proposed statistical tests provide the unprecedented possibility to minimize the number of false positive biclusters without incurring on false negatives, and to compare state-of-the-art biclustering algorithms according to the statistical significance of their outputs. Results on synthetic and real data support the soundness and relevance of the proposed contributions, and stress the need to combine significance and homogeneity criteria to guide the search for biclusters.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G07 Density estimation
62H35 Image analysis in multivariate analysis
Full Text: DOI


[1] Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, ACM, New York, NY, USA, PODS ’98, pp 18-24, doi:10.1145/275487.275490
[2] Alzahrani, M.; Kuwahara, H.; Wang, W.; Gao, X., Gracob: a novel graph-based constant-column biclustering method for mining growth phenotype data, Bioinformatics, (2017)
[3] Balakrishnan S, Kolar M, Rinaldo A, Singh A, Wasserman L (2011) Statistical and computational tradeoffs in biclustering. In: NIPS 2011 workshop on computational trade-offs in statistical learning, vol 4
[4] Barkow, S.; Bleuler, S.; Prelić, A.; Zimmermann, P.; Zitzler, E., Bicat: a biclustering analysis toolbox, Bioinformatics, 22, 1282, (2006)
[5] Bay, SD; Pazzani, MJ, Detecting group differences: mining contrast sets, Data Min Knowl Discov, 5, 213-246, (2001) · Zbl 0982.68048
[6] Bellay, J.; Atluri, G.; Sing, TL; Toufighi, K.; Costanzo, M.; Ribeiro, PSM; Pandey, G.; Baller, J.; VanderSluis, B.; Michaut, M.; Han, S.; Kim, P.; Brown, GW; Andrews, BJ; Boone, C.; Kumar, V.; Myers, CL, Putting genetic interactions in context through a global modular decomposition, Genome Res, 21, 1375-1387, (2011)
[7] Ben-Dor, A.; Chor, B.; Karp, R.; Yakhini, Z., Discovering local structure in gene expression data: the order-preserving submatrix problem, J Comput Biol, 10, 373-384, (2003)
[8] Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society Series B (Methodological), pp 289-300, doi:10.2307/2346101 · Zbl 0809.62014
[9] Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 1165-1188. doi:10.1214/aos/1013699998 · Zbl 1041.62061
[10] Bolton RJ, Hand DJ, Adams NM (2002) Determining hit rate in pattern search. Springer, Berlin, pp 36-48. doi:10.1007/3-540-45728-3_4 · Zbl 1019.68653
[11] Brown, GW, On small-sample estimation, Ann Math Stat, 18, 582-585, (1947) · Zbl 0029.40701
[12] Califano, A.; Stolovitzky, G.; Tu, Y., Analysis of gene expression microarrays for phenotype classification, Int Conf Intell Syst Mol Biol, 8, 75-85, (2000)
[13] Carmona-Saez, P.; Chagoyen, M.; Rodriguez, A.; Trelles, O.; Carazo, JM; Pascual-Montano, A., Integrated analysis of gene expression by association rules discovery, BMC Bioinform, 7, 54, (2006)
[14] Chen, Y.; Xu, J., Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices, J Mach Learn Res, 17, 882-938, (2016) · Zbl 1360.62320
[15] Cheng, Y.; Church, GM, Biclustering of expression data, Intell Syst Mol BiolPress, 8, 93-103, (2000)
[16] DuMouchel, W., Bayesian data mining in large frequency tables, with an application to the fda spontaneous reporting system, Am Stat, 53, 177-190, (1999)
[17] DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’01, pp 67-76, doi:10.1145/502512.502526
[18] Eisen, MB; Spellman, PT; Brown, PO; Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proc Natl Acad Sci, 95, 14,863-14,868, (1998)
[19] Gasch, AP; Spellman, PT; Kao, CM; Carmel-Harel, O.; Eisen, MB; Storz, G.; Botstein, D.; Brown, PO, Genomic expression programs in the response of yeast cells to environmental changes, Mol Biol Cell, 11, 4241-4257, (2000)
[20] Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3). doi:10.1145/1297332.1297338
[21] Gnatyshak D, Ignatov D, Semenov A, Poelmans J (2012) Gaining insight in social networks with biclustering and triclustering. In: Perspectives in business informatics research, LNBIP, vol 128. Springer, Berlin Heidelberg, pp 162-171, doi:10.1007/978-3-642-33281-4_13
[22] Hämälinen W, Nykänen M (2008) Efficient discovery of statistically significant association rules. In: 2008 Eighth IEEE international conference on data mining (ICDM), pp 203-212. doi:10.1109/ICDM.2008.144
[23] Henriques R (2016) Learning from high-dimensional data using local descriptive models. PhD thesis, Instituto Superior Tecnico, Universidade de Lisboa, Lisboa
[24] Henriques, R.; Madeira, S., Bicspam: flexible biclustering using sequential patterns, BMC Bioinform, 15, 130, (2014)
[25] Henriques, R.; Madeira, SC, Bicpam: pattern-based biclustering for biomedical data analysis, Algorithms Mol Biol, 9, 27, (2014)
[26] Henriques, R.; Madeira, SC, Biclustering with flexible plaid models to unravel interactions between biological processes, IEEE/ACM Trans Comput Biol Bioinform (TCBB), 12, 738-752, (2015)
[27] Henriques, R.; Madeira, SC, Bic2pam: constraint-guided biclustering for biological data analysis with domain knowledge, Algorithms Mol Biol, 11, 23, (2016)
[28] Henriques, R.; Madeira, SC, Bicnet: flexible module discovery in large-scale biological networks using biclustering, Algorithms Mol Biol, 11, 1-30, (2016)
[29] Henriques, R.; Antunes, C.; Madeira, SC, A structured view on pattern mining-based biclustering, Pattern Recognit, 48, 3941-3958, (2015)
[30] Henriques, R.; Ferreira, FL; Madeira, SC, Bicpams: software for biological data analysis with pattern-based biclustering, BMC Bioinform, 18, 82, (2017)
[31] Hochreiter, S.; Bodenhofer, U.; Heusel, M.; etal., Fabia: factor analysis for bicluster acquisition, Bioinformatics, 26, 1520-1527, (2010)
[32] Holm, S., A simple sequentially rejective multiple test procedure, Scand J Stat, 6, 65-70, (1979) · Zbl 0402.62058
[33] Huang, DW; Sherman, BT; Lempicki, RA, Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic Acids Res, 37, 1, (2009)
[34] Ihmels, J.; Bergmann, S.; Barkai, N., Defining transcription modules using large-scale gene expression data, Bioinformatics, 20, 1993, (2004)
[35] Jaroszewicz S, Scheffer T (2005) Fast discovery of unexpected patterns in data, relative to a bayesian network. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, ACM, New York, NY, USA, KDD ’05, pp 118-127. doi:10.1145/1081870.1081887
[36] Karian Z, Dudewicz E (2010) Handbook of fitting statistical distributions with R. Taylor & Francis, Milton Park · Zbl 1282.62034
[37] Kirsch, A.; Mitzenmacher, M.; Pietracaprina, A.; Pucci, G.; Upfal, E.; Vandin, F., An efficient rigorous approach for identifying statistically significant frequent itemsets, J ACM, 59, 12:1-12:22, (2012) · Zbl 1281.68098
[38] Koyuturk M, Szpankowski W, Grama A (2004) Biclustering gene-feature matrices for statistically significant dense patterns. In: Proceedings. 2004 IEEE computational systems bioinformatics conference (CSB), pp 480-484. doi:10.1109/CSB.2004.1332467
[39] Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Statistica Sinica 12(1):61-86. http://www.jstor.org/stable/24307036 · Zbl 1004.62084
[40] Lee JD, Sun Y, Taylor JE (2015) Evaluating the statistical significance of biclusters. In: Advances in neural information processing systems 28 (NIPS), Curran Associates, Inc., pp 1324-1332
[41] Lee, W.; Tillo, D.; Bray, N.; Morse, RH; Davis, RW; Hughes, TR; Nislow, C., A high-resolution atlas of nucleosome occupancy in yeast, Nat Genet, 39, 1235-1244, (2007)
[42] Madeira, SC; Oliveira, AL, Biclustering algorithms for biological data analysis: a survey, IEEE/ACM Trans Comput Biol Bioinform (TCBB), 1, 24-45, (2004)
[43] Madeira SC, Oliveira AL (2007) An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data. In: Asia Pacific bioinformatics conference, pp 67-80
[44] Madeira, SC; Teixeira, MC; Sa-Correia, I.; Oliveira, AL, Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm, IEEE/ACM Trans Comput Biol Bioinform (TCBB), 7, 153-165, (2010)
[45] Mahfouz, MA; Ismail, MA, Bidens: Iterative density based biclustering algorithm with application to gene expression analysis, Int J Comput Electr Automa Control Inf Eng, 3, 40-46, (2009)
[46] Mankad, S.; Michailidis, G., Biclustering three-dimensional data arrays with plaid models, J Comput Graph Stat, 23, 943-965, (2014)
[47] Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of the fourth international conference on knowledge discovery and data mining, AAAI Press, KDD’98, pp 274-278
[48] Mitra, S.; Banka, H., Multi-objective evolutionary biclustering of gene expression data, Pattern Recognit, 39, 2464-2477, (2006) · Zbl 1103.68775
[49] Noureen N, Kulsoom N, de la Fuente A, Fazal S, Malik SI (2009) Functional and promoter enrichment based analysis of biclustering algorithms using gene expression data of yeast. In: 2009 IEEE 13th international multitopic conference (INMIC), IEEE, pp 1-6, doi:10.1109/INMIC.2009.5383144
[50] Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2008) Randomization of real-valued matrices for assessing the significance of data mining results. In: Proceedings of the 2008 SIAM international conference on data mining (SDM), SIAM, vol 8, pp 494-505. doi:10.1137/1.9781611972788.45
[51] Okada, Y.; Fujibuchi, W.; Horton, P., A biclustering method for gene expression module discovery using closed itemset enumeration algorithm, IPSJ Trans Bioinform, 3, 183-192, (2007)
[52] Pio G, Ceci M, D’Elia D, Loglisci C, Malerba D (2012) A novel biclustering algorithm for the discovery of meaningful biological correlations between mirnas and mrnas. EMBnetjournal 18(A). doi:10.14806/ej.18.A.375
[53] Ramon J, Miettinen P, Vreeken J (2013) Detecting bicliques in gf[q]. In: Proceedings of the European conference on machine learning and knowledge discovery in databases, vol 8188, Springer New York, Inc., New York, NY, USA, ECML PKDD, pp 509-524. doi:10.1007/978-3-642-40988-2_33
[54] Rosenwald, A.; Wright, G.; Chan, WC; Connors, JM; Campo, E.; Fisher, RI; Gascoyne, RD; Muller-Hermelink, HK; Smeland, EB; Giltnane, JM; Hurt, EM; Zhao, H.; Averett, L.; Yang, L.; Wilson, WH; Jaffe, ES; Simon, R.; Klausner, RD; Powell, J.; Duffey, PL; Longo, DL; Greiner, TC; Weisenburger, DD; Sanger, WG; Dave, BJ; Lynch, JC; Vose, J.; Armitage, JO; Montserrat, E.; López-Guillermo, A.; Grogan, TM; Miller, TP; LeBlanc, M.; Ott, G.; Kvaloy, S.; Delabie, J.; Holte, H.; Krajci, P.; Stokke, T.; Staudt, LM, The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma, N Engl J Med, 346, 1937-1947, (2002)
[55] Scheffer, T., Finding association rules that trade support optimally against confidence, Intell Data Anal, 9, 381-395, (2005)
[56] Serin, A.; Vingron, M., Debi: Discovering differentially expressed biclusters using a frequent itemset approach, Algorithms Mol Biol, 6, 1-12, (2011)
[57] Silberschatz, A.; Tuzhilin, A., What makes patterns interesting in knowledge discovery systems, IEEE Trans Knowl Data Eng, 8, 970-974, (1996)
[58] Silverstein, C.; Brin, S.; Motwani, R., Beyond market baskets: generalizing association rules to dependence rules, Data Min Knowl Discov, 2, 39-68, (1998)
[59] Tanay, A.; Sharan, R.; Shamir, R., Discovering statistically significant biclusters in gene expression data, Bioinformatics, 18, s136, (2002)
[60] Tavazoie, S.; Hughes, J.; Campbell, M.; Cho, R.; Church, G., Systematic determination of genetic network architecture, Nature Genet, 22, 281-285, (1999)
[61] Wang H, Wang W, Yang J, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’02, pp 394-405. doi:10.1145/564691.564737
[62] Webb, GI, Discovering significant patterns, Mach Learn, 68, 1-33, (2007)
[63] Yang J, Wang W, Wang H, Yu P (2002) delta-clusters: capturing subspace correlation in a large data set. In: Proceedings 18th international conference on data engineering (ICDE), IEEE, pp 517-528. doi:10.1109/ICDE.2002.994771
[64] Zhang H, Padmanabhan B, Tuzhilin A (2004) On the discovery of significant statistical quantitative rules. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’04, pp 374-383. doi:10.1145/1014052.1014094
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.