Model complexity selection in high-dimensional time-to-event data analysis.

*(English)*Zbl 1441.62017
Freiburg im Breisgau: Univ. Freiburg, Fakultät für Mathematik und Physik (Diss.). ix, 85 p. (2011).

Summary: Huge amounts of molecular data, such as gene expression data, can be collected simultaneously for each patient, nowadays. In the last decades, several statistical methods have been introduced to handle this so called high-dimensional data. In this thesis, prediction models are considered with time to event as a possibly censored endpoint, e.g. survival data. The focus is on model complexity selection, i.e. the determination of complexity parameter(s) for the models, and related issues.

The thesis consists of three main parts. The first part addresses the question whether regression techniques, which were introduced for or frequently used in high-dimensional data settings, work properly in low-dimensional settings as well. Thus, the stability of model selection, the number of covariates in the selected models, bias of parameter estimates and the prediction performance of some shrinkage and boosting methods are investigated using a well known breast cancer data set and variations thereof. In general, all methods are found to provide reasonable results, however, with slightly different properties.

In the second part, the integrated prediction error curve (IPEC), which is a summary measure of the estimated prediction error over time, is introduced as a model selection criterion. A recent boosting approach, CoxBoost, is used with simulated and real data to compare the IPEC to the standard criterion, the partial log-likelihood (PLL). It is seen that similar results in terms of prediction performance are obtained, indicating that the IPEC is a reasonable criterion. Secondly, different resampling schemes for estimating the PLL and the IPEC are considered. The results do not differ too much but it can be observed that the more intensive approaches do not seem to pay off. The IPEC criterion has the advantage that it is also applicable in semi- or non-parametric settings. Thus, random forests, which are a tree-based prediction approach, are used to examine the possible benefit of this model selection strategy in comparison to rules of thumb using simulated and real data sets. Although the obtained benefit is not very strong in these examples, the IPEC is preferable as a general criterion.

The third part deals with the challenge of estimating the false discovery rate (FDR) in CoxBoost. The FDR allows to quantify the uncertainty of a list of covariates, here, for the covariates in the selected model. A simulation study is carried out to illustrate the behavior of the proposed approach. Despite some difficulties, this multivariable approach should be preferred to univariate approaches.

The thesis consists of three main parts. The first part addresses the question whether regression techniques, which were introduced for or frequently used in high-dimensional data settings, work properly in low-dimensional settings as well. Thus, the stability of model selection, the number of covariates in the selected models, bias of parameter estimates and the prediction performance of some shrinkage and boosting methods are investigated using a well known breast cancer data set and variations thereof. In general, all methods are found to provide reasonable results, however, with slightly different properties.

In the second part, the integrated prediction error curve (IPEC), which is a summary measure of the estimated prediction error over time, is introduced as a model selection criterion. A recent boosting approach, CoxBoost, is used with simulated and real data to compare the IPEC to the standard criterion, the partial log-likelihood (PLL). It is seen that similar results in terms of prediction performance are obtained, indicating that the IPEC is a reasonable criterion. Secondly, different resampling schemes for estimating the PLL and the IPEC are considered. The results do not differ too much but it can be observed that the more intensive approaches do not seem to pay off. The IPEC criterion has the advantage that it is also applicable in semi- or non-parametric settings. Thus, random forests, which are a tree-based prediction approach, are used to examine the possible benefit of this model selection strategy in comparison to rules of thumb using simulated and real data sets. Although the obtained benefit is not very strong in these examples, the IPEC is preferable as a general criterion.

The third part deals with the challenge of estimating the false discovery rate (FDR) in CoxBoost. The FDR allows to quantify the uncertainty of a list of covariates, here, for the covariates in the selected model. A simulation study is carried out to illustrate the behavior of the proposed approach. Despite some difficulties, this multivariable approach should be preferred to univariate approaches.

##### MSC:

62-02 | Research exposition (monographs, survey articles) pertaining to statistics |

62P10 | Applications of statistics to biology and medical sciences; meta analysis |

62H12 | Estimation in multivariate analysis |

62N02 | Estimation in survival analysis and censored data |

62-08 | Computational methods for problems pertaining to statistics |