×

Rare feature selection in high dimensions. (English) Zbl 1464.62334

Summary: It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such “rare features” has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J07 Ridge regression; shrinkage estimators (Lasso)
65F50 Computational methods for sparse matrices
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Arnold, T. B., and Tibshirani, R. J. (2014), “genlasso: Path Algorithm for Generalized Lasso Problems,” R Package Version 1.3.
[2] Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J., “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Foundations and Trends in Machine Learning, 3, 1-122 (2011) · Zbl 1229.90122 · doi:10.1561/2200000016
[3] Cao, Y., Zhang, A., and Li, H. (2017), “Microbial Composition Estimation From Sparse Count Data,” arXiv no. 1706.02380.
[4] Caporaso, J. G.; Kuczynski, J.; Stombaugh, J.; Bittinger, K.; Bushman, F. D.; Costello, E. K.; Fierer, N.; Pena, A. G.; Goodrich, J. K.; Gordon, J. I.; Huttley, G. A.; Kelley, S. T.; Knights, D.; Koenig, J. E.; Ley, R. E.; Lozupone, C. A.; McDonald, D.; Muegge, B. D.; Pirrung, M.; Reeder, J.; Sevinsky, J. R.; Turnbaugh, P. J.; Walters, W. A.; Widmann, J.; Yatsunenko, T.; Zaneveld, J.; Knight, R., “QIIME Allows Analysis of High-Throughput Community Sequencing Data, Nature Methods, 7, 335-336 (2010) · doi:10.1038/nmeth.f.303
[5] Chen, J.; Bushman, F. D.; Lewis, J. D.; Wu, G. D.; Li, H., “Structure-Constrained Sparse Canonical Correlation Analysis With an Application to Microbiome Data Analysis, Biostatistics, 14, 244-258 (2013) · doi:10.1093/biostatistics/kxs038
[6] Feinerer, I., and Hornik, K. (2016), “wordnet: WordNet Interface,” R Package Version 0.1-11.
[7] Feinerer, I., and Hornik, K. (2017), “tm: Text Mining Package,” R Package Version 0.7-1.
[8] Fellbaum, C., WordNet: An Electronic Lexical Database (1998), Cambridge, MA: Bradford Books, Cambridge, MA · Zbl 0913.68054
[9] Forman, G., “An Extensive Empirical Study of Feature Selection Metrics for Text Classification, Journal of Machine Learning Research, 3, 1289-1305 (2003) · Zbl 1102.68553
[10] Friedman, J.; Hastie, T.; Tibshirani, R. J., “Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, 33, 1-22 (2010) · doi:10.18637/jss.v033.i01
[11] Guinot, F.; Szafranski, M.; Ambroise, C.; Samson, F., “Learning the Optimal Scale for GWAS Through Hierarchical SNP Aggregation, BMC Bioinformatics, 19, 1-14 (2017) · doi:10.1186/s12859-018-2475-9
[12] Huang, A. (2008), “Similarity Measures for Text Document Clustering,” in Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, pp. 49-56.
[13] Ke, T.; Fan, J.; Wu, Y., “Homogeneity Pursuit, Journal of the American Statistical Association, 110, 175-194 (2015) · Zbl 1373.62345 · doi:10.1080/01621459.2014.892882
[14] Khabbazian, M.; Kriebel, R.; Rohe, K.; Ané, C., “Fast and Accurate Detection of Evolutionary Shifts in Ornstein-Uhlenbeck Models, Methods in Ecology and Evolution, 7, 811-824 (2016) · doi:10.1111/2041-210X.12534
[15] Kim, S.; Xing, E. P., “Tree-Guided Group Lasso for Multi-Response Regression With Structured Sparsity, With an Application to eQTL Mapping, The Annals of Applied Statistics, 6, 1095-1117 (2012) · Zbl 1254.62112 · doi:10.1214/12-AOAS549
[16] Li, C.; Li, H., “Variable Selection and Regression Analysis for Graph-Structured Covariates With an Application to Genomics, The Annals of Applied Statistics, 4, 1498-1516 (2010) · Zbl 1202.62157 · doi:10.1214/10-AOAS332
[17] Li, Y., Raskutti, G., and Willett, R. (2018), “Graph-Based Regularization for Regression Problems With Highly-Correlated Designs,” arXiv no. 1803.07658.
[18] Lin, W.; Shi, P.; Feng, R.; Li, H., “Variable Selection in Regression With Compositional Covariates, Biometrika, 101, 785-797 (2014) · Zbl 1306.62164 · doi:10.1093/biomet/asu031
[19] Liu, X.; Yu, S.; Janssens, F.; Glänzel, W.; Moreau, Y.; De Moor, B., “Weighted Hybrid Clustering by Combining Text Mining and Bibliometrics on a Large-Scale Journal Database, Journal of the Association for Information Science and Technology, 61, 1105-1119 (2010)
[20] Matsen, F. A.; Kodner, R. B.; Armbrust, E. V., “pplacer: Linear Time Maximum-Likelihood and Bayesian Phylogenetic Placement of Sequences Onto a Fixed Reference Tree, BMC Bioinformatics, 11, 538 (2010) · doi:10.1186/1471-2105-11-538
[21] McMurdie, P. J.; Holmes, S., “phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data, PLOS ONE, 8, 1-11 (2013) · doi:10.1371/journal.pone.0061217
[22] Mohammad, S. M.; Turney, P. D., “Crowdsourcing a Word-Emotion Association Lexicon, Computational Intelligence, 29, 436-465 (2013) · doi:10.1111/j.1467-8640.2012.00460.x
[23] Mukherjee, R.; Pillai, N. S.; Lin, X., “Hypothesis Testing for High-Dimensional Sparse Binary Regression, The Annals of Statistics, 43, 352-381 (2015) · Zbl 1308.62094 · doi:10.1214/14-AOS1279
[24] Pennington, J.; Socher, R.; Manning, C. D., GloVe: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543 (2014) · doi:10.3115/v1/D14-1162
[25] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L., “Deep Contextualized Word Representations,” (2018)
[26] R Core Team, R: A Language and Environment for Statistical Computing (2016), Vienna, Austria: R Foundation for Statistical Computing, Vienna, Austria
[27] Randolph, T. W.; Zhao, S.; Copeland, W.; Hullar, M.; Shojaie, A., “Kernel-Penalized Regression for Analysis of Microbiome Data, The Annals of Applied Statistics, 12, 540 (2015) · Zbl 1393.62092 · doi:10.1214/17-AOAS1102
[28] Ridenhour, B. J.; Brooker, S. L.; Williams, J. E.; Van Leuven, J. T.; Miller, A. W.; Dearing, M. D.; Remien, C. H., “Modeling Time-Series Data From Microbial Communities, The ISME Journal, 11, 2526-2537 (2017) · doi:10.1038/ismej.2017.107
[29] Schloss, P. D.; Westcott, S. L.; Ryabin, T.; Hall, J. R.; Hartmann, M.; Hollister, E.; Lesniewski, R.; Oakley, B.; Parks, D.; Robinson, C.; Sahl, J. W.; Stres, B.; Thallinger, G. G.; Van Horn, D.; Weber, C., “Introducing Mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities, Applied and Environmental Microbiology, 75, 7537-7541 (2009) · doi:10.1128/AEM.01541-09
[30] Sculley, D., Web-Scale k-Means Clustering, 1177-1178 (2010)
[31] She, Y., “Sparse Regression With Exact Clustering, Electronic Journal of Statistics, 4, 1055-1096 (2010) · Zbl 1329.62327 · doi:10.1214/10-EJS578
[32] Shi, P.; Zhang, A.; Li, H., “Regression Analysis for Microbiome Compositional Data, The Annals of Applied Statistics, 10, 1019-1040 (2016) · Zbl 1398.62346 · doi:10.1214/16-AOAS928
[33] Tang, Y., Li, M., and Niclolae, D. L. (2016), “Phylogenetic Dirichlet-Multinomial Model for Microbiome Data,” arXiv no. 1610.08974.
[34] Thelwall, M.; Buckley, K.; Paltoglou, G.; Cai, D.; Kappas, A., “Sentiment in Short Strength Detection Informal Text, Journal of the Association for Information Science and Technology, 61, 2544-2558 (2010) · doi:10.1002/asi.21416
[35] Tibshirani, R. J., “Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society, Series B, 58, 267-288 (1996) · Zbl 0850.62538 · doi:10.1111/j.2517-6161.1996.tb02080.x
[36] Tibshirani, R. J.; Taylor, J., “The Solution Path of the Generalized Lasso, The Annals of Statistics, 39, 1335-1371 (2011) · Zbl 1234.62107 · doi:10.1214/11-AOS878
[37] Wallace, M. (2007), “Jawbone Java WordNet API.”
[38] Wang, H.; Lu, Y.; Zhai, C., “Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach,”, 783-792 (2010)
[39] Wang, J.; Shen, X.; Sun, Y.; Qu, A., “Classification With Unstructured Predictors and an Application to Sentiment Analysis, Journal of the American Statistical Association, 111, 1242-1253 (2016) · doi:10.1080/01621459.2015.1089771
[40] Wang, T.; Zhao, H., “A Dirichlet-Tree Multinomial Regression Model for Associating Dietary Nutrients With Gut Microorganisms, Biometrics, 73, 792-801 (2017) · Zbl 1522.62251 · doi:10.1111/biom.12654
[41] Wang, T.; Zhao, H., “Structured Subcomposition Selection in Regression and Its Application to Microbiome Data Analysis, The Annals of Applied Statistics, 11, 771-791 (2017) · Zbl 1391.62253
[42] Xia, F.; Chen, J.; Fung, W. K.; Li, H., “A Logistic Normal Multinomial Regression Model for Microbiome Compositional Data Analysis, Biometrics, 69, 1053-1063 (2013) · Zbl 1288.62171 · doi:10.1111/biom.12079
[43] Yu, G.; Liu, Y., “Sparse Regression Incorporating Graphical Structure Among Predictors, Journal of the American Statistical Association, 111, 707-720 (2016) · doi:10.1080/01621459.2015.1034319
[44] Zhai, J.; Kim, J.; Knox, K. S.; Twigg, H. L.; Zhou, H.; Zhou, J. J., “Variance Component Selection With Applications to Microbiome Taxonomic Data, Frontiers in Microbiology, 9, 509 (2018) · doi:10.3389/fmicb.2018.00509
[45] Zhang, T.; Shao, M.-F.; Ye, L., “454 Pyrosequencing Reveals Bacterial Diversity of Activated Sludge From 14 Sewage Treatment Plants, The ISME Journal, 6, 1137-1147 (2012) · doi:10.1038/ismej.2011.188
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.