×

zbMATH — the first resource for mathematics

Efficient regularized isotonic regression with application to gene-gene interaction search. (English) Zbl 1235.62046
Ann. Appl. Stat. 6, No. 1, 253-283 (2012); correction ibid. 9, No. 4, 2266-2267 (2015).
Summary: Isotonic regression is a nonparametric approach for fitting monotonic models to data that has been widely studied from both theoretical and practical perspectives. However, this approach encounters computational and statistical overfitting issues in higher dimensions. To address both concerns, we present an algorithm, which we term Isotonic Recursive Partitioning (IRP), for isotonic regression based on recursively partitioning the covariate space through solution of progressively smaller “best cut” subproblems. This creates a regularized sequence of isotonic models of increasing model complexity that converges to the global isotonic regression solution. The models along the sequence are often more accurate than the unregularized isotonic regression model because of the complexity control they offer. We quantify this complexity control through estimation of degrees of freedom along the path. Success of the regularized models in prediction and IRPs favorable computational properties are demonstrated through a series of simulated and real data experiments. We discuss application of IRP to the problem of searching for gene-gene interactions and epistasis, and demonstrate it on data from genome-wide association studies of three common diseases.

MSC:
62G08 Nonparametric regression and quantile regression
62-08 Computational methods for problems pertaining to statistics
62G05 Nonparametric estimation
92C40 Biochemistry, molecular biology
62P10 Applications of statistics to biology and medical sciences; meta analysis
PDF BibTeX XML Cite
Full Text: DOI Euclid arXiv
References:
[1] Auh, S. and Sampson, A. R. (2006). Isotonic logistic discrimination. Biometrika 93 961-972. · Zbl 1436.62237
[2] Bacchetti, P. (1989). Additive isotonic models. J. Amer. Statist. Assoc. 84 289-294.
[3] Barlow, R. E. and Brunk, H. D. (1972). The isotonic regression problem and its dual. J. Amer. Statist. Assoc. 67 140-147. · Zbl 0236.62050
[4] Block, H., Qian, S. and Sampson, A. (1994). Structure algorithms for partially ordered isotonic regression. J. Comput. Graph. Statist. 3 285-300.
[5] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization . Cambridge Univ. Press, Cambridge. · Zbl 1058.90049
[6] Breiman, L., Friedman, J., Stone, C. J. and Olshen, R. A. (1984). Classification and Regression Trees . Chapman and Hall/CRC, Boca Raton, FL. · Zbl 0541.62042
[7] Chandrasekaran, R., Ryu, Y. U., Jacob, V. S. and Hong, S. (2005). Isotonic separation. INFORMS J. Comput. 17 462-474. · Zbl 1241.90157
[8] Cordell, H. J. (2009). Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 10 392-404.
[9] de Leeuw, J., Hornik, K. and Mair, P. (2009). Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods. Dept. Statistics, UCLA. Available at .
[10] DeLong, E. R., DeLong, D. M. and Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44 837-845. · Zbl 0715.62207
[11] Dykstra, R. L. and Robertson, T. (1982). An algorithm for isotonic regression for two or more independent variables. Ann. Statist. 10 708-716. · Zbl 0485.65099
[12] Efron, B. (1986). How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc. 81 461-470. · Zbl 0621.62073
[13] Eichler, E. E., Flint, J., Gibson, G., Kong, A., Leal, S. M., Moore, J. H. and Nadeau, J. H. (2010). Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11 446-450.
[14] Emily, M., Mailund, T., Hein, J., Schauser, L. and Schierup, M. H. (2009). Using biological networks to search for interacting loci in genome-wide association studies. Eur. J. Hum. Genet. 17 1231-1240.
[15] Frank, A. and Asuncion, A. (2010). UCI machine learning repository. Auto MPG data set. Available at .
[16] Galil, Z. and Naamad, A. (1980). An O ( EV log 2 V ) algorithm for the maximal flow problem. J. Comput. System Sci. 21 203-217. · Zbl 0449.90094
[17] Gneiting, T. (2011). Making and evaluating point forecasts. J. Amer. Statist. Assoc. 106 746-762. · Zbl 1232.62028
[18] Goldstein, D. B. (2009). Common genetic variation and human traits. N. Engl. J. Med. 360 1696-1698.
[19] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning : Data Mining , Inference , and Prediction . Springer, New York. · Zbl 0973.62007
[20] He, X., Ng, P. and Portnoy, S. (1998). Bivariate quantile smoothing splines. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 537-550. · Zbl 0909.62038
[21] Hindorff, L. A., Junkins, H. A., Hall, P. N., Mehta, J. P. and Manolio, T. A. (2011). A catalog of published genome-wide association studies. Available at .
[22] Hochbaum, D. S. and Queyranne, M. (2003). Minimizing a convex cost closure set. SIAM J. Discrete Math. 16 192-207 (electronic). · Zbl 1041.68070
[23] Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29 1-27. · Zbl 0123.36803
[24] Lee, C. I. C. (1983). The min-max algorithm and isotonic regression. Ann. Statist. 11 467-477. · Zbl 0521.62060
[25] Luss, R., Rosset, S. and Shahar, M. (2010). Decomposing isotonic regression for efficiently solving large problems. In Proceeedings of the Neural Information Processing Systems Conference (J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel and A. Culotta, eds.) 1513-1521.
[26] Mani, R., Onge, R. P. S., Hartman, J. L., Giaever, G. and Roth, F. P. (2007). Defining genetic interaction. Proc. Nat. Acad. Sci. U.S.A. 105 3461-3466.
[27] Maxwell, W. L. and Muckstadt, J. A. (1985). Establishing consistent and realistic reorder intervals in production-distribution systems. Oper. Res. 33 1316-1341. · Zbl 0579.90048
[28] Meyer, M. and Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Ann. Statist. 28 1083-1104. · Zbl 1105.62340
[29] Monteiro, R. D. C. and Adler, I. (1989). Interior path following primal-dual algorithms. II. Convex quadratic programming. Math. Program. 44 43-66. · Zbl 0676.90039
[30] Obozinski, G., Lanckriet, G., Grant, C., Jordan, M. I. and Noble, W. S. (2008). Consistent probabilistic outputs for protein function prediction. Genome Biology 9 247-254.
[31] Pardalos, P. M. and Xue, G. (1999). Algorithms for a class of isotonic regression problems. Algorithmica 23 211-222. · Zbl 0921.68045
[32] Roth, F. P., Lipshitz, H. D. and Andrews, B. J. (2009). Q&A: Epistasis. J. Biol. 8 35.
[33] Roundy, R. (1986). A 98%-effective lot-sizing rule for a multiproduct, multistage production/inventory system. Math. Oper. Res. 11 699-727. · Zbl 0613.90041
[34] Schell, M. J. and Singh, B. (1997). The reduced monotonic regression method. J. Amer. Statist. Assoc. 92 128-135. · Zbl 0890.62035
[35] Shao, H., Burrage, L. C., Sinasac, D. S., Hill, A. E., Ernest, S. R., O’Brien, W., Courtland, H.-W., Jepsen, K. J., Kirby, A., Kulbokas, E. J., Daly, M. J., Bromang, K. W., Lander, E. S. and Nadeau, J. H. (2008). Genetic architecture of complex traits: Large phenotypic effects and pervasive epistasis. Proc. Nat. Acad. Sci. U.S.A. 50 11910-19914.
[36] Sleator, D. D. and Tarjan, R. E. (1983). A data structure for dynamic trees. J. Comput. System Sci. 26 362-391. · Zbl 0509.68058
[37] Spouge, J., Wan, H. and Wilbur, W. J. (2003). Least squares isotonic regression in two dimensions. J. Optim. Theory Appl. 117 585-605. · Zbl 1043.90011
[38] Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135-1151. · Zbl 0476.62035
[39] Stout, Q. (2010). An approach to computing multidimensional isotonic regressions. Unpublished manuscript. Available at .
[40] WTCCC (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661-678.
[41] Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. J. Amer. Statist. Assoc. 93 120-131. · Zbl 0920.62056
[42] Zhang, Y., Zhang, J. and Liu, J. S. (2011). Block-based Bayesian epistasis association mapping with application to WTCCC Type 1 diabetes data. Ann. Appl. Statist. 5 2052-2077. · Zbl 1228.62152
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.