This paper presents a study of an n-gram-based document comparison method. The method is intended to build a large-scale plagiarism detection system. The work focuses not only on an efficiency of the text similarity extraction but also on the execution performance of the implemented algorithms. We took notice of detection performance, storage requirements and execution time of the proposed approach. The obtained results show the trade-offs between detection quality and computational requirements. The GPGPU and multi-CPU platforms were considered to implement the algorithms and to achieve good execution speed. The method consists of two main algorithms: a document's feature extraction and fast text comparison. The winnowing algorithm is used to generate a compressed representation of the analyzed documents. The authors designed and implemented a dedicated test framework for the algorithm. That allowed for the tuning, evaluation, and optimization of the parameters. Well-known metrics (e.g. precision, recall) were used to evaluate detection performance. The authors conducted the tests to determine the performance of the winnowing algorithm for obfuscated and unobfuscated texts for a different window and n-gram size. Also, a simplified version of the text comparison algorithm was proposed and evaluated to reduce the computational complexity of the text comparison process. The paper also presents GPGPU and multi-CPU implementations of the algorithms for different data structures. The implementation speed was tested for different algorithms' parameters and the size of data. The scalability of the algorithm on multi-CPU platforms was verified. The authors of the paper provide the repository of software tools and programs used to perform the conducted experiments.he appropriate fast document comparison system. Its performance is given in the paper.
Publié le : 2017-11-29
Classification:  Knowledge and Information Engineering,  Text similarity analysis, n-gram-based model, GPGPU implementation, multi-CPU implementation
@article{cai2017_4_887,
     author = {Maciej Wielgosz; AGH University of Science and Technology, Krakow and Pawe\l\ Szczepka; AGH University of Science and Technology, Krakow and Pawel Russek; AGH University of Science and Technology, Krakow and Ernest Jamro; AGH University of Science and Technology, Krakow and Kazimierz Wiatr; AGH University of Science and Technology, Krakow and Marcin Pietro\'n; ACC Cyfronet AGH, Krakow and Dominik \.Zurek; ACC Cyfronet AGH, Krakow},
     title = {Evaluation and Implementation of n-Gram-Based Algorithm for Fast Text Comparison},
     journal = {Computing and Informatics},
     volume = {35},
     number = {4},
     year = {2017},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai2017_4_887}
}
Maciej Wielgosz; AGH University of Science and Technology, Krakow; Paweł Szczepka; AGH University of Science and Technology, Krakow; Pawel Russek; AGH University of Science and Technology, Krakow; Ernest Jamro; AGH University of Science and Technology, Krakow; Kazimierz Wiatr; AGH University of Science and Technology, Krakow; Marcin Pietroń; ACC Cyfronet AGH, Krakow; Dominik Żurek; ACC Cyfronet AGH, Krakow. Evaluation and Implementation of n-Gram-Based Algorithm for Fast Text Comparison. Computing and Informatics, Tome 35 (2017) no. 4, . http://gdmltest.u-ga.fr/item/cai2017_4_887/