×

zbMATH — the first resource for mathematics

An empirical study of the maximal and total information coefficients and leading measures of dependence. (English) Zbl 1393.62094
Summary: In exploratory data analysis, we are often interested in identifying promising pairwise associations for further analysis while filtering out weaker ones. This can be accomplished by computing a measure of dependence on all variable pairs and examining the highest-scoring pairs, provided the measure of dependence used assigns similar scores to equally noisy relationships of different types. This property, called equitability and previously formalized, can be used to assess measures of dependence along with the power of their corresponding independence tests and their runtime.{
}Here we present an empirical evaluation of the equitability, power against independence, and runtime of several leading measures of dependence. These include the two recently introduced and simultaneously computable statistics \({\mathrm{MIC}_{e}}\), whose goal is equitability, and \({\mathrm{TIC}_{e}}\), whose goal is power against independence.{
}Regarding equitability, our analysis finds that \({\mathrm{MIC}_{e}}\) is the most equitable method on functional relationships in most of the settings we considered. Regarding power against independence, we find that \({\mathrm{TIC}_{e}}\) and R. Heller et al.’s [Biometrika 100, No. 2, 503–510 (2013; Zbl 1284.62332)] \({S^{\mathrm{DDP}}}\) share state-of-the-art performance, with several other methods achieving excellent power as well. Our analyses also show evidence for a trade-off between power against independence and equitability consistent with recent theoretical work. Our results suggest that a fast and useful strategy for achieving a combination of power against independence and equitability is to filter relationships by \({\mathrm{TIC}_{e}}\) and then to rank the remaining ones using \({\mathrm{MIC}_{e}}\). We confirm our findings on a set of data collected by the World Health Organization.

MSC:
62P10 Applications of statistics to biology and medical sciences; meta analysis
62H20 Measures of association (correlation, canonical correlation, etc.)
62G10 Nonparametric hypothesis testing
Citations:
Zbl 1284.62332
Software:
HHG
PDF BibTeX XML Cite
Full Text: DOI Euclid
References:
[1] Algeo, T. J. and Lyons, T. W. (2006). Mo – total organic carbon covariation in modern anoxic marine environments: Implications for analysis of paleoredox and paleohydrographic conditions. Paleoceanography 21 PA1016.
[2] Breiman, L. and Friedman, J. (1985). Estimating optimal transformations for multiple regression and correlation. J. Amer. Statist. Assoc.80 580-598. · Zbl 0594.62044
[3] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth Advanced Books and Software, Belmont, CA. · Zbl 0541.62042
[4] Caspi, A., Sugden, K., Moffitt, T. E., Taylor, A., Craig, I. W., Harrington, H., McClay, J., Mill, J., Martin, J., Braithwaite, A. and Poulton, R. (2003). Influence of life stress on depression: Moderation by a polymorphism in the 5-HTT gene. Science 301 386-389.
[5] Clayton, R. N. and Mayeda, T. K. (1996). Oxygen isotope studies of achondrites. Geochim. Cosmochim. Acta 60 1999-2017.
[6] Ding, A. A. and Li, Y. (2013). Copula correlation: An equitable dependence measure and extension of pearson’s correlation. Preprint. Available at arXiv:1312.7214.
[7] Emilsson, V., Thorleifsson, G., Zhang, B., Leonardson, A. S., Zink, F., Zhu, J., Carlson, S., Helgason, A., Bragi Walters, G., Gunnarsdottir, S. et al. (2008). Genetics of gene expression and its effect on disease. Nature 452 423-428.
[8] Gill, T. et al. (2002). Obesity in the pacific: Too big to ignore. World Health Organization Regional Office for the Western Pacific, Secretariat of the Pacific Community.
[9] Gorfine, M., Heller, R. and Heller, Y. (2012). Comment on “Detecting novel associations in large data sets.” Unpublished. Available at http://www.math.tau.ac.il/ ruheller/Papers/science6.pdf. · Zbl 1348.62162
[10] Gretton, A., Bousquet, O., Smola, A. and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. In Algorithmic Learning Theory 63-77. Springer, Berlin. · Zbl 1168.62354
[11] Gretton, A., Fukumizu, K., Teo, C. H., Le, S., Schölkopf, B. and Smola, A. J. (2008). A kernel statistical test of independence. In Advances in Neural Information Processing Systems 585-592.
[12] Heller, R., Heller, Y. and Gorfine, M. (2013). A consistent multivariate test of association based on ranks of distances. Biometrika 100 503-510. · Zbl 1284.62332
[13] Heller, R., Heller, Y., Kaufman, S., Brill, B. and Gorfine, M. (2016). Consistent distribution-free \(k\)-sample and independence tests for univariate random variables. J. Mach. Learn. Res.17 1-54. · Zbl 1360.62217
[14] Hoeffding, W. (1948). A non-parametric test of independence. Ann. Math. Stat. 546-557. · Zbl 0032.42001
[15] Huo, X. and Szekely, G. J. (2014). Fast computing for distance covariance. Preprint. Available at arXiv:1410.1503.
[16] Jaakkola, T. S. and Haussler, D. (1999). Probabilistic kernel regression models. In AISTATS.
[17] Jiang, B., Ye, C. and Liu, J. S. (2015). Nonparametric k-sample tests via dynamic slicing. J. Amer. Statist. Assoc.110 642-653. · Zbl 1373.62195
[18] Kinney, J. B. and Atwal, G. S. (2014). Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA 111 3354-3359. · Zbl 1359.62213
[19] Kraskov, A., Stogbauer, H. and Grassberger, P. (2004). Estimating mutual information. Phys. Rev. E 69 066138.
[20] Linfoot, E. H. (1957). An informational measure of correlation. Inf. Control 1 85-89. · Zbl 0080.36001
[21] Lopez-Paz, D., Hennig, P. and Schölkopf, B. (2013). The randomized dependence coefficient. In Advances in Neural Information Processing Systems 1-9.
[22] Moon, Y.-I., Rajagopalan, B. and Lall, U. (1995). Estimation of mutual information using kernel density estimators. Phys. Rev. E 52 2318-2321.
[23] Murrell, B., Murrell, D. and Murrell, H. (2014). R2-equitability is satisfiable. Proc. Natl. Acad. Sci. USA 111 E2160-E2160. Available at http://www.pnas.org/content/early/2014/04/29/1403623111.
[24] Paninski, L. (2003). Estimation of entropy and mutual information. Neural Comput.15 1191-1253. · Zbl 1052.62003
[25] Rényi, A. (1959). On measures of dependence. Acta Math. Hungar.10 441-451.
[26] Reshef, D. N., Reshef, Y. A., Sabeti, P. C. and Mitzenmacher, M. (2018a). Appendix to “An empirical study of the maximal and total information coefficients and leading measures of dependence.” DOI:10.1214/17-AOAS1093SUPPA. · Zbl 1393.62094
[27] Reshef, D. N., Reshef, Y. A., Sabeti, P. C. and Mitzenmacher, M. (2018b). Supplement to “An empirical study of the maximal and total information coefficients and leading measures of dependence.” DOI:10.1214/17-AOAS1093SUPPB. · Zbl 1393.62094
[28] Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M. and Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science 334 1518-1524. · Zbl 1359.62216
[29] Reshef, D., Reshef, Y., Mitzenmacher, M. and Sabeti, P. (2013). Equitability analysis of the maximal information coefficient, with comparisons. Preprint. Available at arXiv:1301.6314. · Zbl 1393.62094
[30] Reshef, D. N., Reshef, Y. A., Mitzenmacher, M. and Sabeti, P. C. (2014). Cleaning up the record on the maximal information coefficient and equitability. Proc. Natl. Acad. Sci. USA 111 E3362-E3363. Available at http://www.pnas.org/content/early/2014/08/07/1408920111. · Zbl 1393.62094
[31] Reshef, Y. A., Reshef, D. N., Sabeti, P. C. and Mitzenmacher, M. (2015). Equitability, interval estimation, and statistical power. Available at arXiv:1505.02212. · Zbl 1436.62032
[32] Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C. and Mitzenmacher, M. (2016). Measuring dependence powerfully and equitably. J. Mach. Learn. Res.17 Paper No. 212, 63. · Zbl 1436.62032
[33] Sejdinovic, D., Sriperumbudur, B., Gretton, A. and Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist.41 2263-2291. · Zbl 1281.62117
[34] Simon, N. and Tibshirani, R. (2012). Comment on “Detecting novel associations in large data sets”. Unpublished. Available at http://statweb.stanford.edu/tibs/reshef/comment.pdf.
[35] Speed, T. (2011). A correlation for the 21st century. Science 334 1502-1503.
[36] Szekely, G. J. and Rizzo, M. L. (2009). Brownian distance covariance. Ann. Appl. Stat.3 1236-1265. · Zbl 1196.62077
[37] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[38] Wang, X., Jiang, B. and Liu, J. S. (2017). Generalized R-squared for detecting dependence. Biometrika 104 129-139. · Zbl 07072186
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.