Data De-Duplication with Adaptive Chunking and Accelerated Modification Identifying

Xingjun Zhang; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049; Guofeng Zhu; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049; Endong Wang; Inspur(Beijing) Electronic Information Industry Co. Ltd., 100085, Beijing; Scott Fowler; Department of Science and Technology, Linköping University, Campus Norrköping, SE-601 74; Xiaoshe Dong; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049

Xingjun Zhang; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049 ; Guofeng Zhu; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049 ; Endong Wang; Inspur(Beijing) Electronic Information Industry Co. Ltd., 100085, Beijing ; Scott Fowler; Department of Science and Technology, Linköping University, Campus Norrköping, SE-601 74 ; Xiaoshe Dong; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049

Computing and Informatics, Tome 34 (2016) no. 4, / Harvested from Computing and Informatics

Text on Computing and Informatics

Résumé

The data de-duplication system not only pursues the high de-duplication rate, which refers to the aggregate reduction in storage requirements gained from de-duplication, but also the de-duplication speed. To solve the problem of random parameter-setting brought by Content Defined Chunking (CDC), a self-adaptive data chunking algorithm is proposed. The algorithm improves the de-duplication rate by conducting pre-processing de-duplication to the samples of the classified files and then selecting the appropriate algorithm parameters. Meanwhile, FastCDC, a kind of content-based fast data chunking algorithm, is adopted to solve the problem of low de-duplication speed of CDC. By introducing de-duplication factor and acceleration factor, FastCDC can significantly boost de-duplication speed while not sacrificing the de-duplication rate through adjusting these two parameters. The experimental results demonstrate that our proposed method can improve the de-duplication rate by about 5 %, while FastCDC can obtain the increase of de-duplication speed by 50 % to 200 % only at the expense of less than 3 % de-duplication rate loss.

Publié le : 2016-11-02
Classification: Data de-duplication, self-adaptive, FastCDC

@article{cai1687,
     author = {Xingjun Zhang; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049 and Guofeng Zhu; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049 and Endong Wang; Inspur(Beijing) Electronic Information Industry Co. Ltd., 100085, Beijing and Scott Fowler; Department of Science and Technology, Link\"oping University, Campus Norrk\"oping, SE-601 74 and Xiaoshe Dong; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049},
     title = {Data De-Duplication with Adaptive Chunking and Accelerated Modification Identifying},
     journal = {Computing and Informatics},
     volume = {34},
     number = {4},
     year = {2016},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai1687}
}

Xingjun Zhang; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049; Guofeng Zhu; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049; Endong Wang; Inspur(Beijing) Electronic Information Industry Co. Ltd., 100085, Beijing; Scott Fowler; Department of Science and Technology, Linköping University, Campus Norrköping, SE-601 74; Xiaoshe Dong; Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710049. Data De-Duplication with Adaptive Chunking and Accelerated Modification Identifying. Computing and Informatics, Tome 34 (2016) no. 4, . http://gdmltest.u-ga.fr/item/cai1687/