##
**Tournament screening cum EBIC for feature selection with high-dimensional feature spaces.**
*(English)*
Zbl 1176.62014

Summary: Feature selection characterized by relatively small sample size and an extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes serious difficulties: (i) the sample correlations between features become high even if the features are stochastically independent; (ii) the computations become intractable. These difficulties make conventional approaches either inapplicable or inefficient. The reduction of dimensionality of the feature space followed by low dimensional approaches appears the only feasible way to tackle this problem.

Along this line, we develop a tournament screening cum EBIC approach for feature selection with high dimensional feature space. The procedure of tournament screening mimics that of a tournament. It is shown theoretically that the tournament screening has the sure screening property, a necessary property which should be satisfied by any valid screening procedure. It is demonstrated by numerical studies that the tournament screening cum extended Bayes information criterion (EBIC) approach enjoys desirable properties such as having higher positive selection rate and lower false discovery rate than other approaches.

Along this line, we develop a tournament screening cum EBIC approach for feature selection with high dimensional feature space. The procedure of tournament screening mimics that of a tournament. It is shown theoretically that the tournament screening has the sure screening property, a necessary property which should be satisfied by any valid screening procedure. It is demonstrated by numerical studies that the tournament screening cum extended Bayes information criterion (EBIC) approach enjoys desirable properties such as having higher positive selection rate and lower false discovery rate than other approaches.

### MSC:

62F07 | Statistical ranking and selection procedures |

62P10 | Applications of statistics to biology and medical sciences; meta analysis |

62J05 | Linear regression; mixed models |

65C60 | Computational problems in statistics (MSC2010) |

PDFBibTeX
XMLCite

\textit{Z. Chen} and \textit{J. Chen}, Sci. China, Ser. A 52, No. 6, 1327--1341 (2009; Zbl 1176.62014)

Full Text:
DOI

### References:

[1] | Hunter D, Li R. Variable selection via MM algorithms. Ann Statist, 33: 1617–1642 (2005) · Zbl 1078.62028 |

[2] | Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimation in sparse high-dimensional regression models. Ann Statist, 36: 587–613 (2008) · Zbl 1133.62048 |

[3] | Paul D, Bair E, Hastie T, et al. ”Preconditioning” for feature selection and regression in high-dimensional problems. Ann Statist, 36: 1595–1618 (2007) · Zbl 1142.62022 |

[4] | Zhang C H, Huang J. The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann Statist, 36: 1567–1594 (2008) · Zbl 1142.62044 |

[5] | Kosorok M R, Ma S. Marginal asymptotics for the ”large p, small n” paradigm: With applications to microarray data. Ann Statist, 35: 1456–1486 (2007) · Zbl 1123.62005 |

[6] | Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space. Ann Statist, 70: 849–911 (2007) |

[7] | Tusher V, Tibshirani R, Chu C. Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proc Nat Acad Sci USA, 98: 5116–5121 (2001) · Zbl 1012.92014 |

[8] | Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA, 99: 6567–6572 (2002) |

[9] | Marchini J, Donnelly P, Cardon L R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics, 37: 413–417 (2005) |

[10] | Benjamini Y, Hochberg Y. Controlling the false discovery rate – A practical and powerful approach to multiple testing. J Royal Statist Soc Ser B, 57: 289–300 (1995) · Zbl 0809.62014 |

[11] | Storey J D, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA, 100: 9440–9445 (2003) · Zbl 1130.62385 |

[12] | Hoh J, Wille A, Ott J. Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Research, 11: 2115–2119 (2001) |

[13] | Hoh J, Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Reviews Genetics, 4: 701–709 (2003) |

[14] | Zaykin D V, Zhivotovsky L A, Westfall P H, et al. Truncated product method for combining p-values, Genet Epidemiol, 22: 170–185 (2002) |

[15] | Dudbridge F, Koeleman B P C. Rank truncated product of P-values, with application to genome wide association scans. Genet Epidemiol, 25: 360–366 (2003) |

[16] | Tibshirani R. Regression shrinkage and selection via the LASSO. J Royal Statist Soc Ser B, 58: 267–288 (1996) · Zbl 0850.62538 |

[17] | Fan J, Li R. Variable selection via non-concave penalized likelihood and its oracle properties. J Amer Statist Assoc, 96: 1348–1360 (2001) · Zbl 1073.62547 |

[18] | Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Statist Soc Ser B, 67: 301–320 (2005) · Zbl 1069.62054 |

[19] | Efron B, Hastie T, Johnstone I, et al. Least angle regression. Ann Statist, 32: 407–499 (2004) · Zbl 1091.62054 |

[20] | Ishwaran H, Rao J S. Detecting differentially expressed genes in microarrays using Bayesian model selection. J Amer Statist Assoc, 98: 438–455 (2003) · Zbl 1041.62090 |

[21] | Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model space. Biometrika, 95: 759–771 (2008) · Zbl 1437.62415 |

[22] | Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Second International Symposium on Information Theory, eds. B.N. Petrox and F. Caski. Budapest: Akademiai Kiado, 267, 1973 · Zbl 0283.62006 |

[23] | Schwarz G. Estimating the dimension of a model. Ann Statist, 6: 461–464 (1978) · Zbl 0379.62005 |

[24] | Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann Statist, 35: 2313–2351 (2007) · Zbl 1139.62019 |

[25] | Amos C I. Robust variance-components approach for assessing genetic linkage in pedigrees. Am J Hum Genet, 54: 535–543 (1994) |

[26] | Chen Z, Chen J, Liu J. A tournament approach to the detection of multiple associations in genome-wide studies with pedigree data. Working Paper 2006-09, www.stats.uwaterloo.ca. Department of Statistics & Actuarial Sciences, University of Waterloo, 2006 |

[27] | Serfling R J. Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons, 1980 · Zbl 0538.62002 |

[28] | Broman K W, Speed T P. A model selection approach for the identification of quantitative trait loci in experimental crosses. J Royal Statist Soc Ser B, 64: 641–656 (2002) · Zbl 1067.62108 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.