Group variable selection methods and their applications in analysis of genomic data. (English) Zbl 1417.92109

Feng, Jianfeng (ed.) et al., Frontiers in computational and systems biology. Dedicated to the 70th birthday of Minping Qian. London: Springer. Comput. Biol. 15, 231-248 (2010).
From the text: Although large-scale genomic data have been routinely created in biomedical research, extracting useful information from the data remains a challenge. Available statistical and computational tools encounter major difficulties of high dimensionality and complicated dependence in the data. This chapter discusses variable selection approaches for high dimensions and, more importantly, new ideas of group variable selection. The group information naturally embedded in biological systems or pathways helps to enhance signals in analysis of genomic data.
Traditional forward selection is a heuristic approach, not guaranteeing an optimal solution. LARS, a less greedy version of traditional forward selection method, however, is shown by B. Efron et al. [Ann. Stat. 32, No. 2, 407–499 (2004; Zbl 1091.62054)] to be closely related to Lasso, which possesses optimal properties under appropriate conditions [K. Knight and W. Fu, Ann. Stat. 28, No. 5, 1356–1378 (2000; Zbl 1105.62357); C.-H. Zhang and J. Huang, Ann. Stat. 36, No. 4, 1567–1594 (2008; Zbl 1142.62044); P. Zhao and B. Yu, J. Mach. Learn. Res. 7, 2541–2563 (2006; Zbl 1222.62008)]. Our proposed gLars and gRidge take advantage of the LARS procedure while aiming at group selections for dependent data. The methods do not require prior information on the underlying group structures but construct groups along the selection procedure. Our grouping criteria consider the joint information of \(x\) and \(y\) and therefore better fit the context of variable selection than standard clustering on \(x\) alone. On the other hand, any prior information on the model or groups can be easily incorporated into the algorithms of gLars and gRidge by manually selecting certain variables at specific steps. The current methods may be improved by exploring different thresholds \((t_1, t_2)\) in the grouping definition.
\(\text{SCAD}\_\ell 2\) is a combination of the unbiased approach SCAD and the ridge regression. It is not computationally efficient as the forward procedure but possess good properties in terms of coefficient estimation. One of our future works is to extend the proposed group selection methods to general regression models, where \(y\) may depend on \(x\) through any nonlinear function. The proposed methods are more appropriate than other variable selection algorithms for data with complicated dependent structures.
For the entire collection see [Zbl 1194.92031].


92D10 Genetics and epigenetics
62P10 Applications of statistics to biology and medical sciences; meta analysis
Full Text: DOI