×

Ranking and selecting terms for text categorization via SVM discriminate boundary. (English) Zbl 1211.68494

Summary: The problem of natural language document categorization consists of classifying documents into predetermined categories based on their contents. Each distinct term, or word, in the documents is a feature for representing a document. In general, the number of terms may be extremely large and the dozens of redundant terms may be included, which may reduce the classification performance. In this paper, a support vector machine (SVM) based feature ranking and selecting method for text categorization is proposed. The contribution of each term for classification is calculated based on the nonlinear discriminant boundary, which is generated by the SVM. The results of experiments on several real-world data sets show that the proposed method is powerful enough to extract a smaller number of important terms and achieves a higher classification performance than existing feature selecting methods based on latent semantic indexing and \(\chi ^{2}\) statistics values.

MSC:

68U15 Computing methodologies for text processing; mathematical typography
62P99 Applications of statistics
68T05 Learning and adaptive systems in artificial intelligence

Software:

LIBSVM; Bow
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Rocchio, The SMART retrieval system: Experiments in automatic document processing pp 313– (1979)
[2] McCallum, AAAI-98 Workshop on Learning for Text Categorization (1998)
[3] Yang, An evaluation of statistical approaches to text categorization, Inf Retr 1 (1/2) pp 69– (1999)
[4] Salton, The SMART retrieval system (1971)
[5] Salton, Introduction to modern information retrieval (1983) · Zbl 0523.68084
[6] Deerwester, Indexing by latent semantic analysis, J Am Soc Inform Sci 41 (6) pp 391– (1990)
[7] Chen, A new differential LSI space-based probabilistic document classifier, Inf Process Lett 88 pp 203– (2003) · Zbl 1178.68213
[8] Fortuna, Improved support vector classification using PCA and ICA feature space modification, Pattern Recognit 37 pp 1117– (2004) · Zbl 1070.68538
[9] Shima, SVM-based feature selection of latent semantic features, Pattern Recognit Lett 25 pp 1051– (2004)
[10] Vapnik, Statistical learning theory (1998)
[11] Joachims, Learning to classify text using support vector machines (2002) · doi:10.1007/978-1-4615-0907-3
[12] Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. 2001. Software available at http://www.csie.ntu.edu.tw/â{\(\tfrac14\)}cjlin/libsvm.
[13] Joachims, Advances in kernel methods pp 169– (1999)
[14] Platt, Advances in kernel methods pp 185– (1999)
[15] Zhang J, Liu Y. SVM decision boundary based discriminative subspace induction. Technical Report CMU-RI-TR-02-15, 2002, Carnegie Mellon University.
[16] Lee, Feature extraction based on decision boundaries, IEEE Trans Pattern Anal Mach Intell 15 (4) pp 388– (1993)
[17] Shawe-Taylor, Kernel methods for pattern analysis (2004) · doi:10.1017/CBO9780511809682
[18] Craven, Proceedings of AAAI-98, 15th Conf American Association for Artificial Intelligence pp 509– (1998)
[19] Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In: Proc 7th Int Conf on Information Retrieval Knowledge Manage, 1998. pp 148â155.
[20] Joachims T. Text categorization with support vector machines: Learning with many relevant features. In: Proc 10th European Conf on Machine Learning; 1998. pp 137â142,
[21] Yang Y, Pederson JO. Feature selection in statistical learning of text categorization. In: The 14th Int Conf on Machine Learning; 1997. pp 412â420.
[22] McCallum AK. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. 1996. http://www.cs.cmu.edu/mccallum/bow
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.