English

Using WordNet to Complement Training Information in Text Categorization

cmp-lg 2008-02-03 v1 Computation and Language

Abstract

Automatic Text Categorization (TC) is a complex and useful task for many natural language applications, and is usually performed through the use of a set of manually classified documents, a training collection. We suggest the utilization of additional resources like lexical databases to increase the amount of information that TC systems make use of, and thus, to improve their performance. Our approach integrates WordNet information with two training approaches through the Vector Space Model. The training approaches we test are the Rocchio (relevance feedback) and the Widrow-Hoff (machine learning) algorithms. Results obtained from evaluation show that the integration of WordNet clearly outperforms training approaches, and that an integrated technique can effectively address the classification of low frequency categories.

Keywords

Cite

@article{arxiv.cmp-lg/9709007,
  title  = {Using WordNet to Complement Training Information in Text Categorization},
  author = {Manuel de Buenaga Rodriguez and Jose Maria Gomez Hidalgo and Belen Diaz Agudo},
  journal= {arXiv preprint arXiv:cmp-lg/9709007},
  year   = {2008}
}

Comments

16 pages, 1 figure, 3 tables, previously with RANLP latext style