There have been several challenges in summarization of Thai multiple documents since Thai language itself lacks of explicit word/phrase/sentence boundaries. This paper gives definition of Thai Elementary Discourse Unit (TEDU) and then presents our three-stage summarization process. Towards implementation of this process, we propose unit segmentation using TEDUs and their derivatives, unit-graph formation using iterative unit weighting and cosine similarity, and unit selection using highest-weight priority, redundancy removal, and post-selection weight recalculation. To examine performance of the proposed methods, a number of experiments are conducted using fifty sets of Thai news articles with their manually constructed reference summary. By three common evaluation measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the results evidence that (1) our TEDU-based summarization outperforms paragraph-based summarization, (2) our iterative weighting is superior to traditional TF-IDF, (3) the highest-weight priority without centroid preference and unit redundancy consideration helps improving summary quality, and (4) post-selection weight recalculation tends to raise summarization performance under some certain circumstances.
Publié le : 2016-05-31
Classification:  Knowledge and Information Engineering,  Thai text summarization, multi-document summarization, iterative weighting,  68T50
@article{cai2209,
     author = {Nongnuch Ketui; School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University and Thanaruk Theeramunkong; School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University},
     title = {Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection},
     journal = {Computing and Informatics},
     volume = {34},
     number = {4},
     year = {2016},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai2209}
}
Nongnuch Ketui; School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University; Thanaruk Theeramunkong; School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University. Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection. Computing and Informatics, Tome 34 (2016) no. 4, . http://gdmltest.u-ga.fr/item/cai2209/