In this paper we compare usefulness of statistical techniques of dimensionality reduction for improving clustering of documents in Polish. We start with partitional and agglomerative algorithms applied to Vector Space Model. Then we investigate two transformations: Latent Semantic Analysis and Probabilistic Latent Semantic Analysis. The obtained results showed advantage of Latent Semantic Analysis technique over probabilistic model. We also analyse time and memory consumption aspects of these transformations and present runtime details for IBM BladeCenter HS21 machine.
Publié le : 2015-02-04
Classification:  Document clustering, latent semantic analysis, probabilistic latent semantic analysis, natural language processing,  68T50, 68T05, 68T35
@article{cai2794,
     author = {Marcin Kuta; AGH University of Science and Technology, Department of Computer Science, Al. Mickiewicza 30, 30-059 Krak\'ow and Jacek Kitowski; AGH University of Science and Technology, Department of Computer Science, Al. Mickiewicza 30, 30-059 Krak\'ow},
     title = {Comparison of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis for Documents Clustering},
     journal = {Computing and Informatics},
     volume = {33},
     number = {3},
     year = {2015},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai2794}
}
Marcin Kuta; AGH University of Science and Technology, Department of Computer Science, Al. Mickiewicza 30, 30-059 Kraków; Jacek Kitowski; AGH University of Science and Technology, Department of Computer Science, Al. Mickiewicza 30, 30-059 Kraków. Comparison of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis for Documents Clustering. Computing and Informatics, Tome 33 (2015) no. 3, . http://gdmltest.u-ga.fr/item/cai2794/