The main goal of this work was to experimentally verify the methods for a challenging task of categorization and clustering Polish text. Supervised and unsupervised learning was employed respectively for the categorization and clustering. A profound examination of the employed methods was done for the custom-built corpus of Polish texts. The corpus was assembled by the authors from Internet resources. The corpus data was acquired from the news portal and, therefore, it was sorted by type by journalists according to their specialization. The presented algorithms employ Vector Space Model (VSM) and TF-IDF (Term Frequency-Inverse Document Frequency) weighing scheme. Series of experiments were conducted that revealed certain properties of algorithms and their accuracy. The accuracy of algorithms was elaborated regarding their ability to match human arrangement of the documents by the topic. For both the categorization and clustering, the authors used F-measure to assess the quality of allocation.
Publié le : 2017-05-12
Classification:  Knowledge and Information Engineering,  Polish text, categorization, clustering, VSM, TF-IDF
@article{cai2017_1_186,
     author = {Maciej Wielgosz; AGH University of Science and Technology, Cracow and Rafa\l\ Fraczek; AGH University of Science and Technology, Cracow and Pawe\l\ Russek; AGH University of Science and Technology, Cracow and Marcin Pietro\'n; AGH University of Science and Technology, Cracow and Agnieszka Dabrowska-Boruch; AGH University of Science and Technology, Cracow and Ernest Jamro; AGH University of Science and Technology, Cracow and Kazimierz Wiatr; AGH University of Science and Technology, Cracow},
     title = {Experiment on Methods for Clustering and Categorization of Polish Text},
     journal = {Computing and Informatics},
     volume = {35},
     number = {4},
     year = {2017},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai2017_1_186}
}
Maciej Wielgosz; AGH University of Science and Technology, Cracow; Rafał Fraczek; AGH University of Science and Technology, Cracow; Paweł Russek; AGH University of Science and Technology, Cracow; Marcin Pietroń; AGH University of Science and Technology, Cracow; Agnieszka Dabrowska-Boruch; AGH University of Science and Technology, Cracow; Ernest Jamro; AGH University of Science and Technology, Cracow; Kazimierz Wiatr; AGH University of Science and Technology, Cracow. Experiment on Methods for Clustering and Categorization of Polish Text. Computing and Informatics, Tome 35 (2017) no. 4, . http://gdmltest.u-ga.fr/item/cai2017_1_186/