In this paper we propose a novel kernel for text categorization. This kernel is an inner product defined in the feature space generated by all word combinations of specified length. A word combination is a collection of unique words co-occurring in the same sentence. The word combination of length k is weighted by the k rm th root of the product of the inverse document frequencies (IDF) of its words. By discarding word order, the word combination features are more compatible with the flexibility of natural language and the feature dimensions of documents can be reduced significantly to improve the sparseness of feature representations. By restricting the words to the same sentence and considering multi-word combinations, the word combination features can capture similarity at a more specific level than single words. A computationally simple and efficient algorithm was proposed to calculate this kernel. We conducted a series of experiments on the Reuters-21578 and 20 Newsgroups datasets. This kernel achieves better performance than the word kernel and word-sequence kernel. We also evaluated the computing efficiency of this kernel and observed the impact of the word combination length on performance.
Publié le : 2013-11-15
Classification:  Machine learning, kernel methods, support vector machines, text classification, word-combination kernel,  62H30, 46E22, 68T05, 68T50
@article{cai1976,
     author = {Lujiang Zhang; School of Automation Science and Electrical Engineering, Beijing University of Aeronautics and Astronautics and Xiaohui Hu; School of Automation Science and Electrical Engineering, Beijing University of Aeronautics and Astronautics},
     title = {Word Combination Kernel for Text Classification with Support Vector Machines},
     journal = {Computing and Informatics},
     volume = {31},
     number = {6},
     year = {2013},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai1976}
}
Lujiang Zhang; School of Automation Science and Electrical Engineering, Beijing University of Aeronautics and Astronautics; Xiaohui Hu; School of Automation Science and Electrical Engineering, Beijing University of Aeronautics and Astronautics. Word Combination Kernel for Text Classification with Support Vector Machines. Computing and Informatics, Tome 31 (2013) no. 6, . http://gdmltest.u-ga.fr/item/cai1976/