##
**Adaptive threshold-based classification of sparse high-dimensional data.**
*(English)*
Zbl 1493.62395

Summary: We revisit the problem of designing an efficient binary classifier in a challenging high-dimensional framework. The model under study assumes some local dependence structure among feature variables represented by a block-diagonal covariance matrix with a growing number of blocks of an arbitrary, but fixed size. The blocks correspond to non-overlapping independent groups of strongly correlated features. To assess the relevance of a particular block in predicting the response, we introduce a measure of “signal strength” pertaining to each feature block. This measure is then used to specify a sparse model of our interest. We further propose a threshold-based feature selector which operates as a screen-and-clean scheme integrated into a linear classifier: the data is subject to screening and hard threshold cleaning to filter out the blocks that contain no signals. Asymptotic properties of the proposed classifiers are studied when the sample size \(n\) depends on the number of feature blocks \(b\), and the sample size goes to infinity with \(b\) at a slower rate than \(b\). The new classifiers, which are fully adaptive to unknown parameters of the model, are shown to perform asymptotically optimally in a large part of the classification region. The numerical study confirms good analytical properties of the new classifiers that compare favorably to the existing threshold-based procedure used in a similar context.

### MSC:

62H30 | Classification and discrimination; cluster analysis (statistical aspects) |

62H12 | Estimation in multivariate analysis |

62E20 | Asymptotic distribution theory in statistics |

### Keywords:

high-dimensional data; sparse vectors; adaptive threshold-based classification; asymptotically optimal classifier
PDFBibTeX
XMLCite

\textit{T. Pavlenko} et al., Electron. J. Stat. 16, No. 1, 1952--1996 (2022; Zbl 1493.62395)

### References:

[1] | AHMAD, R. M. and PAVLENKO, T. (2018). A \(U\)-classifier for high-dimensional data for non-normality. Journal of Multivariate Analysis 167 269-283. · Zbl 1395.62146 |

[2] | AOSHIMA, M. and YATA, K. (2014). A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Annals of the Institute of Mathematical Statistics 66 983-1010. · Zbl 1309.62108 |

[3] | BERNSTEIN, S. N., Probability Theory. OGIZ, Moscow-Leningrad (1946). In Russian. |

[4] | BUTUCEA, C. and STEPANOVA, N. (2017). Adaptive variable selection in nonparametric sparse additive models. Electronic Journal of Statistics 11 2321-2357. · Zbl 1365.62133 |

[5] | Cai, T. and Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association 106 1566-1577. · Zbl 1233.62129 |

[6] | CHAN, Y.-B. and HALL, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika 96 469-478. · Zbl 1163.62045 |

[7] | CSÖRGŐ, M., CSÖRGŐ, S., HORVÁTH, L. and MASON, D. (1986). Weighted empirical and quantile processes. The Annals of Probability 14 31-85. · Zbl 0589.60029 |

[8] | CSÖRGŐ, M. and HORVÁTH, L. (1993). Weighted Approximations in Probability and Statistics. Wiley, New York. · Zbl 0770.60038 |

[9] | DONOHO, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics 32 962-994. · Zbl 1092.62051 |

[10] | DONOHO, D. and JIN, J. (2009). Feature selection by higher criticism thresholding achieves the optimal phase diagram. Philosophical Transactions of the Royal Society A 367 4449-4470. · Zbl 1185.62113 |

[11] | FAN, Y., JIN, J. and YAO, Z. (2013). Optimal classification in sparse Gaussian graphic models. The Annals of Statistics 41 2537-2571. · Zbl 1294.62061 |

[12] | GENOVESE, C. R., JIN, J., WASSERMAN, L. and YAO, Z. (2012). A comparison of the lasso and marginal regression. Journal of Machine Learning Research 13 2107-2143. · Zbl 1435.62270 |

[13] | GREENSHTEIN, E. and RITOV, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of over-parametrization. Bernoulli 10 971-988. · Zbl 1055.62078 |

[14] | HAN, C.-P. (1975). Some relationships between noncentral chi-squared and normal distributions. Biometrika 62 213-214. · Zbl 0298.62005 |

[15] | INGSTER, YU. I. (1997). Some problems of hypothesis testing leading to infinitely divisible distribution. Mathematical Methods of Statistics 6 47-69. · Zbl 0878.62005 |

[16] | INGSTER, YU. I., POUET, C. and TSYBAKOV, A. B. (2009). Classification of sparse high-dimensional vectors. Philosophical Transactions of the Royal Society A 367 4427-4448. · Zbl 1185.62115 |

[17] | INGSTER, YU. I. and STEPANOVA, N. A. (2014). Adaptive variable selection in nonparametric sparse regression. Journal of Mathematical Sciences 199 184-201. · Zbl 1311.62056 |

[18] | LEPSKI, O. (1991). One problem of adaptive estimation in Gaussian white noise. Theory of Probability and Its Applications 35 454-466. · Zbl 0745.62083 |

[19] | PETROV, V. V. (2004). Limit Theorems of Probability Theory. Clarendon Press, Oxford. |

[20] | RAO, C. R. (1973). Linear Statistical Inference and Its Applications. John Wiley and Sons, New York. · Zbl 0256.62002 |

[21] | SHAO, J., WANG, Y., DENG, X. and WANG. S. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of Statistics 39 1241-1265. · Zbl 1215.62062 |

[22] | SIOTANI, M. (1971). An asymptotic expansion of the non-null distribution of Hotelling’s generalized \[{T_0^2}\]-statistic. The Annals of Mathematical Statistics 42 560-571. · Zbl 0249.62051 |

[23] | SITGREAVES, R. (1952). On the distribution of two random matrices used in classification procedures. The Annals of Mathematical Statistics 23 263-270. · Zbl 0046.36001 |

[24] | STEPANOVA, N. and PAVLENKO, T. (2018). Goodnes-of-fit tests based on sup-functionals of weighted empirical processes. Theory of Probability and Its Applications 63 358-388. · Zbl 1404.62046 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.