×

Non parametric statistical models for on-line text classification. (English) Zbl 1256.62025

Summary: Social media, such as blogs and on-line forums, contain a huge amount of information that is typically unorganized and fragmented. An important issue, that has been raising importance so far, is to classify on-line texts in order to detect possible anomalies. For example on-line texts representing consumer opinions can be, not only very precious and profitable for companies, but can also represent a serious damage if they are negative or faked. We present a novel statistical methodology rooted in the context of classical text classification, in order to address such issues. In the literature, several classifiers have been proposed, among them support vector machines and naive Bayes classifiers. These approaches are not effective when coping with the problem of classifying texts belonging to an unknown author. To this aim, we propose to employ a new method, based on the combination of classification trees with nonparametric approaches, such as Kruskal-Wallis and E. Brunner, H. Dette and A. Munk [J. Am. Stat. Assoc. 92, No. 440, 1494–1502 (1997; Zbl 0921.62096)] tests. The main application of what we propose is the capability to classify an author as a new one, that is potentially trustable, or as an old one, that is potentially faked.

MSC:

62G10 Nonparametric hypothesis testing
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P99 Applications of statistics

Citations:

Zbl 0921.62096
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Andrews FC (1954) Asymptotic behavior of some rank test for analysis of variance. Ann Math Stat 25(4):724–736 · Zbl 0056.37302 · doi:10.1214/aoms/1177728658
[2] Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of SIGIR-98, 21st ACM international conference on research and development in information retrieval (Melbourne), pp 96–103
[3] Benzecri J (1973) L’analyse des donnees. Dunod, Paris
[4] Boullé M (2009) Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach. Adv Data Anal Classif 3(1):39–61 · Zbl 1231.62030 · doi:10.1007/s11634-009-0038-7
[5] Breiman L, Friedman JH, Olshen R, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont · Zbl 0541.62042
[6] Brunner E, Dette H, Munk A (1997) Box-type approximations in nonparametric factorial designs. J Am Statist Assoc 92:1494–1502 · Zbl 0921.62096 · doi:10.1080/01621459.1997.10473671
[7] Cerchiello P (2011) Statistical models to measure corporate reputation. In J Appl Quant Method 6(4):58–71
[8] Conover WJ (1971) Practical nonparametric statistics. Wiley, New York
[9] Dagan I, Karov Y, Roth D (1997) Mistake driven learning in text categorization. In: Proceedings of EMNLP-97, second conference on empirical methods in natural language processing, Providence, pp 55–63
[10] Forman G (2003) An Extensive empirical study of feature selection metrics for text classification. J Mach Lear Res 3:1289–1306 · Zbl 1102.68553
[11] Frame S, Jammalamadaka S (2007) Generalized mixture models, semi-supervised learning, and unknown class inference. Adv Data Anal Classif 1(1):23–38 · Zbl 1133.62301 · doi:10.1007/s11634-006-0001-9
[12] Greenacre M (2007) Correspondence Analysis in Practice, 2nd edn. Chapman and Hall, CRC, London · Zbl 1198.62061
[13] Guyon I, Elissee A (2003) An introduction to variable and feature selection. J Mach Lear Res 3(3): 1157–1182 · Zbl 1102.68556
[14] Jindal N, Liu B (2008) Opinion spam and analysis. In: Proceedings WSDM-08, USA. doi: 10.1145/1341531.1341560
[15] Jindal N, Liu B, Lim EP (2010) Finding unusual review patterns using unexpected rules. In: Proceedings ACM-10, Canada. doi: 10.1145/1871437.1871669
[16] Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, Germany, pp 137–142
[17] Johnson NL, Kotz S, Balakrishnan N (1995) Continuous univariate distributions 2, 2nd edn. Wiley, New York · Zbl 0821.62001
[18] Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29(2):119–127 · doi:10.2307/2986296
[19] Kim YH, Hahn SY, Zhang BT (2000) Text filtering by boosting naive Bayes classifiers. In: Proceedings of SIGIR-00, Greece. doi: 10.1145/345508.345572
[20] Le Thi H, Le H, Nguyen V, Pham Dinh T (2008) A DC programming approach for feature selection in support vector machines learning. Adv Data Anal Classif 2(3):259–278 · Zbl 1284.90057 · doi:10.1007/s11634-008-0030-7
[21] Najork M (2009) Web spam detection encyclopedia of database systems. Springer, Berlin
[22] Rust SW, Fligner MA (1984) A modification of the Kruskal–Wallis statistic for the generalized Behrens–Fisher problem. Commun Stat Theor Meth 13(16):2013–2027 · Zbl 0552.62026
[23] Siegel S, Castellan NJ Jr (1988) Nonparametric statistics for the behavioral sciences, 2nd edn. McGraw-Hill, London
[24] Stoppiglia H, Dreyfus G, Dubois R, Oussar Y (2003) Ranking a random feature for variable and feature selection. J Mach Lear Res 3:1399–1414 · Zbl 1102.68598
[25] Wilcox RR (2005) Introduction to robust estimation and hypothesis testing, 2nd edn. Elsevier Academic Press, Burlington · Zbl 1113.62036
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.