×

On the relation between the true and sample correlations under Bayesian modelling of gene expression datasets. (English) Zbl 1398.92119

Summary: The prediction of cancer prognosis and metastatic potential immediately after the initial diagnoses is a major challenge in current clinical research. The relevance of such a signature is clear, as it will free many patients from the agony and toxic side-effects associated with the adjuvant chemotherapy automatically and sometimes carelessly subscribed to them. Motivated by this issue, several previous works presented a Bayesian model which led to the following conclusion: thousands of samples are needed to generate a robust gene list for predicting outcome. This conclusion is based on existence of some statistical assumptions including asymptotic independence of sample correlations. The current work makes two main contributions: (1) It shows that while the assumptions of the Bayesian model discussed by previous papers seem to be non-restrictive, they are quite strong. To demonstrate this point, it is shown that some standard sparse and Gaussian models are not included in the set of models which are mathematically consistent with these assumptions. (2) It is shown that the empirical Bayes methodology which was applied in order to test the relevant assumptions does not detect severe violations and consequently an overestimation of the required sample size might be incurred. Finally, we suggest that under some regularity conditions it is possible that the current theoretical results can be used for development of a new method to test the asymptotic independence assumption.

MSC:

92C50 Medical applications (general)
92C40 Biochemistry, molecular biology
62F15 Bayesian inference
62P10 Applications of statistics to biology and medical sciences; meta analysis
92B15 General biostatistics

Software:

tmg; HdBCS
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Alam, K. (1979): “Distribution of sample correlation coefficients.” Nav. Res. Logist., 26, 327-330. · Zbl 0401.62041
[2] Alam, K. & M. H. S. Rizvi (1976): “Selection of largest multiple correlation coefficients: exact sample size case.” Ann. Stat., 4, 614-620. · Zbl 0329.62024
[3] Caravlho, C., J. Chang, J. Lucas, J. Nevins, Q. Wang and M. West (2008): “High-dimensional sparse factor modeling: applications in gene expression genomics.” JASA, 103, 1438-1456. · Zbl 1286.62091
[4] Cui, X. and J. Wilson (2008): “On the probability of correct selection for large k populations with application to microarray data.” Biometrical J., 50, 833-870.
[5] Cui, X., H. Zhao and J. Wilson (2010): “Optimized ranking and selection methods for feature selection with application in microarray experiments.” J. Biopharm. Stat., 20, 223-239.
[6] Dobra, A., C. Hans, B. J. J. N. G. Y. and M. West (2004): “Sparse graphical models for exploring gene expression data.” J. Multivariate Anal., 90, 196-212. · Zbl 1047.62104
[7] Donoho, D. (2000): High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture.
[8] Ein-Dor, L., O. Zuk and E. Domany (2006): “Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer.” Proc. Natl. Acad. Sci. USA, 103, 5923-5928.
[9] Ferguson, T. (1996): A course in large sample theory, Chapman and Hall, London. · Zbl 0871.62002
[10] Fisher, R. (1915): “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population.” Biometrika, 10, 507-521.
[11] Fisher, R. (1921): “On the probable error of a coefficient of correlation deduced from a small sample.” Metron, 1, 3-32.
[12] Guyon, I. and A. Elisseeff (2003): “An introduction to variable and feature selection.” J. Mach. Learn. Res., 3, 1157-1182. · Zbl 1102.68556
[13] Hall, M. (1998): Correlation based feature selection for machine learning. PhD thesis, Department of Computer-Science, University of Waikato, Hamilton, New-Zealand.
[14] Isserlis, L. (1918): “On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables.” Biometrika, 12, 134-139.
[15] Jacobovic, R. and O. Zuk (2017): “On the asymptotic efficiency of selection procedures for independent gaussian populations.” Electron. J. Stat., 11, 5375-5405. · Zbl 1387.62027
[16] Knowles, D. and Z. Ghahramani (2011): “Nonparametric bayesian sparse factor models with application to gene expression modeling.” Ann. Appl. Stat., 5, 1534-1552. · Zbl 1223.62013
[17] Levy, K. (1975): “Selecting the best population from among k binomial populations or the population with the largest correlation coefficient from among k bivariate normal populations.” Psychometrika, 40, 121-122. · Zbl 0319.62014
[18] Levy, K. (1977): “Appropriate sample sizes for selecting a population with the largest correlation coefficient from among k bivariate normal populations.” Educ. Psychol. Meas., 37, 61-66.
[19] McDowell, I. C., D. Manandhar, C. Vockley, A. Schmid and T. Reddy (2018): “Clustering gene expression time series data using an infinite gaussian process mixture model.” PLoS Comput. Biol., 14, e1005896.
[20] Pakman, A. and L. Paninski (2014): “Exact hamiltonian monte carlo for truncated multivariate gaussians.” J. Comput. Graph. Stat., 23, 518-542.
[21] Ramberg, J. (1977): “Selecting the best predictor variate.” Commun. Stat. Theory Methods, 11, 1133-1147. · Zbl 0375.62029
[22] Rizvi, M. H. H. S. (1973): “Selection of largest multiple correlation coefficients: asymptotic case.” J. Am. Stat. Assoc., 68, 184-188. · Zbl 0262.62028
[23] Spiegel, M. R. (1968): Mathematical handbook of formulas and tables. Schaum.
[24] Wilcox, R. (1978): “Some comments on selecting the best of several binomial populations or the bivariate normal population having the largest correlation coefficient.” Psychometrika, 43, 127-128.
[25] Yeung, K., C. Fraley, A. Murua, A. Raftery and W. Ruzzo (2001): “Model-based clustering and data transformations for gene expression data.” Bioinformatics, 17, 977-987.
[26] Yu, L. and H. Liu (2003): “Feature selection for high-dimensional data: a fast correlation-based filter solution.” Proceedings of the twentieth International Conference on Machine Learning, page 856-863.
[27] Zuk, O., L. Ein-Dor and E. Domany (2007): “Ranking under uncertainty.” UAI, 466-473.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.