Nowadays, almost all text corpora, such as blogs, emails and RSS feeds, are a collection of text streams. The traditional vector space model (VSM), or bag-of-words representation, cannot capture the temporal aspect of these text streams. So far, only a few bursty features have been proposed to create text representations with temporal modeling for the text streams. We propose bursty feature representations that perform better than VSM on various text mining tasks, such as document retrieval, topic modeling and text categorization. For text clustering, we propose a novel framework to generate bursty distance measure. We evaluated it on UPGMA, Star and K-Medoids clustering algorithms. The bursty distance measure did not only perform equally well on various text collections, but it was also able to cluster the news articles related to specific events much better than other models.
Publié le : 2013-01-30
Classification:  Document clustering, bursty model, web mining
@article{cai1330,
     author = {Apirak Hoonlor; Faculty of ICT, Mahidol University, Bangkok and Boles\l aw K. Szymanski; Rennsealer Polytechnic Institute, Troy, NY and Mohamed J. Zaki; Computer Science Department, Rennsealer Polytechnic Institute, Troy,N.Y. and Vineet Chaoji; Yahoo! Labs, Bangalore},
     title = {Document Clustering with Bursty Information},
     journal = {Computing and Informatics},
     volume = {31},
     number = {6},
     year = {2013},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai1330}
}
Apirak Hoonlor; Faculty of ICT, Mahidol University, Bangkok; Bolesław K. Szymanski; Rennsealer Polytechnic Institute, Troy, NY; Mohamed J. Zaki; Computer Science Department, Rennsealer Polytechnic Institute, Troy,N.Y.; Vineet Chaoji; Yahoo! Labs, Bangalore. Document Clustering with Bursty Information. Computing and Informatics, Tome 31 (2013) no. 6, . http://gdmltest.u-ga.fr/item/cai1330/