×

zbMATH — the first resource for mathematics

Multinomial inverse regression for text analysis. (English) Zbl 06224965
Summary: Text data, including speeches, stories, and other document forms, are often connected to sentiment variables that are of interest for research in marketing, economics, and elsewhere. It is also very high dimensional and difficult to incorporate into statistical analyses. This article introduces a straightforward framework of sentiment-sufficient dimension reduction for text data. Multinomial inverse regression is introduced as a general tool for simplifying predictor sets that can be represented as draws from a multinomial distribution, and we show that logistic regression of phrase counts onto document annotations can be used to obtain low-dimensional document representations that are rich in sentiment information. To facilitate this modeling, a novel estimation technique is developed for multinomial logistic regression with very high-dimensional response. In particular, independent Laplace priors with unknown variance are assigned to each regression coefficient, and we detail an efficient routine for maximization of the joint posterior over coefficients and their prior scale. This ”gamma-lasso” scheme yields stable and effective estimation for general high-dimensional logistic regression, and we argue that it will be superior to current methods in many settings. Guidelines for prior specification are provided, algorithm convergence is detailed, and estimator properties are outlined from the perspective of the literature on nonconcave likelihood penalization. Related work on sentiment analysis from statistics, econometrics, and machine learning is surveyed and connected. Finally, the methods are applied in two detailed examples and we provide out-of-sample prediction studies to illustrate their effectiveness.

MSC:
62 Statistics
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Agresti A., Categorical Data Analysis (2nd ed.) (2002) · Zbl 1018.62002 · doi:10.1002/0471249688
[2] Bishop Y., Discrete Multivariate Analysis (1975)
[3] Blei D. M., Proceedings of the Neural Information Processing Systems pp 1– (2007)
[4] Blei D. M., Journal of Machine Learning Research 3 pp 993– (2003)
[5] Bollen J., Journal of Computational Science 2 pp 1– (2011) · doi:10.1016/j.jocs.2010.12.007
[6] Bura E., Journal of the Royal Statistical Society, Series B 63 pp 393– (2001) · Zbl 0979.62041 · doi:10.1111/1467-9868.00292
[7] DOI: 10.1080/01621459.1992.10475231 · doi:10.1080/01621459.1992.10475231
[8] Carvalho C. M., Biometrika 97 pp 465– (2010) · Zbl 1406.62021 · doi:10.1093/biomet/asq017
[9] Chang J., lda: Collapsed Gibbs Sampling Methods for Topic Models, R package version 1.3.1 (2011)
[10] Cook R. D., Statistical Science 22 pp 1– (2007) · Zbl 1246.62148 · doi:10.1214/088342306000000682
[11] DOI: 10.1198/jcgs.2009.08005 · doi:10.1198/jcgs.2009.08005
[12] DOI: 10.1198/016214501753382273 · Zbl 1073.62547 · doi:10.1198/016214501753382273
[13] Fan J., The Annals of Statistics 32 pp 928– (2004) · Zbl 1092.62031 · doi:10.1214/009053604000000256
[14] DOI: 10.1080/00401706.1993.10485033 · doi:10.1080/00401706.1993.10485033
[15] Friedman J. H., Journal of Statistical Software 33 (1) pp 1– (2010) · doi:10.18637/jss.v033.i01
[16] Gail M. H., Biometrika 71 pp 341– (1984) · Zbl 0567.62031 · doi:10.1093/biomet/71.2.341
[17] Gelman A., Arm: Data Analysis Using Regression and Multilevel/Hierarchical Models, R package version 1.5-03 (2012)
[18] DOI: 10.1198/004017007000000245 · doi:10.1198/004017007000000245
[19] Gentzkow M., Econometrica 78 pp 35– (2010) · Zbl 05683005 · doi:10.3982/ECTA7195
[20] Gramacy R. B., Monomvn: Estimation for Multivariate Normal and Student-t Data With Monotone Missingness, R package version 1.8-10 (2012)
[21] Gramacy R. B., Bayesian Analysis 7 pp 1– (2012) · Zbl 1330.62301 · doi:10.1214/12-BA719
[22] Grimmer J., Political Analysis 18 pp 1– (2010) · doi:10.1093/pan/mpp034
[23] Hastie T., The Elements of Statistical Learning (2009) · Zbl 1273.62005 · doi:10.1007/978-0-387-84858-7
[24] Holmes C. C., Bayesian Analysis 1 pp 145– (2006) · Zbl 1331.62142 · doi:10.1214/06-BA105
[25] Jurafsky D., Speech and Language Processing (2nd ed.) (2009)
[26] Karatzoglou A., Journal of Statistical Software 11 (9) pp 1– (2004) · doi:10.18637/jss.v011.i09
[27] Krishnapuram B., IEEE Transactions on Pattern Analysis and Machine Intelligence 27 pp 957– (2005) · Zbl 05111576 · doi:10.1109/TPAMI.2005.127
[28] DOI: 10.1080/10618600.2000.10474858 · doi:10.1080/10618600.2000.10474858
[29] Laver M., American Political Science Review 97 pp 311– (2003) · doi:10.1017/S0003055403000698
[30] Lehmann E. L., Sankhyā: The Indian Journal of Statistics 10 pp 305– (1950)
[31] DOI: 10.1080/01621459.1991.10475035 · doi:10.1080/01621459.1991.10475035
[32] Li L., Biometrika 94 pp 615– (2007) · Zbl 1134.62045 · doi:10.1093/biomet/asm043
[33] Loughran T., Journal of Finance 66 pp 35– (2011) · doi:10.1111/j.1540-6261.2010.01625.x
[34] Luenberger D. G., Linear and Nonlinear Programming (3rd ed.) (2008)
[35] Madigan D., AIP Conference Proceedings pp 509– (2005) · doi:10.1063/1.2149832
[36] Mauá D. D., ENIA ’09: VIII Enconro Nacional de Inteligência Artificial pp 1– (2009)
[37] Mazunder R., Journal of the American Statistical Association 106 pp 1125– (2011) · Zbl 1229.62091 · doi:10.1198/jasa.2011.tm09738
[38] Pang B., Foundations and Trends in Information Retrieval 1 pp 1– (2008) · Zbl 05519304 · doi:10.1561/1500000011
[39] DOI: 10.1198/016214508000000337 · Zbl 1330.62292 · doi:10.1198/016214508000000337
[40] Poon H., Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) pp 1– (2009)
[41] Porter M. F., Program 14 pp 130– (1980) · doi:10.1108/eb046814
[42] Quinn K., American Journal of Political Science 54 pp 209– (2010) · doi:10.1111/j.1540-5907.2009.00427.x
[43] Rossi P. E., Bayesian Statistics and Marketing (2005) · Zbl 1094.62037 · doi:10.1002/0470863692
[44] Schervish M. J., Theory of Statistics (1995) · Zbl 0834.62002 · doi:10.1007/978-1-4612-4250-5
[45] Srivastava A. N., Text Mining: Classification, Clustering, and Applications (2009) · Zbl 1177.68175 · doi:10.1201/9781420059458
[46] Taddy M., Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS 2012) pp 1184– (2012)
[47] Talley E. L., Journal of Institutional and Theoretical Economics 168 pp 181– (2011) · doi:10.1628/093245612799440177
[48] Tetlock P., Journal of Finance 62 pp 1139– (2007) · doi:10.1111/j.1540-6261.2007.01232.x
[49] Thomas M., Proceedings of Emperical Methods in Natural Language Processing pp 327– (2006)
[50] Tibshirani R., Journal of the Royal Statistical Society, Series B 58 pp 267– (1996)
[51] West M., Bayesian Statistics (Vol. 7) pp 733– (2003)
[52] Wold H., Perspectives in Probability and Statistics: Papers in Honour of M.S. Bartlett pp 117– (1975)
[53] DOI: 10.1080/19331680802149608 · doi:10.1080/19331680802149608
[54] Zeger S. L., Biometrika 72 pp 31– (1985)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.