In computing, duplicate data detection refers to identifying duplicate copies of repeating data. Identifying duplicate data items in streamed data and eliminating them before storing, is a complex job. This paper proposes a novel data structure for duplicate detection using a variant of stable Bloom filter named as FingerPrint Stable Bloom Filter (FP-SBF). The proposed approach uses counting Bloom filter with fingerprint bits along with an optimization mechanism for duplicate detection. FP-SBF uses d-left hashing which reduces the computational time and decreases the false positives as well as false negatives. FP-SBF can process unbounded data in single pass, using k hash functions, and successfully differentiate between duplicate and distinct elements in O(k+1) time, independent of the size of incoming data. The performance of FP-SBF has been compared with various Bloom Filters used for stream data duplication detection and it has been theoretically and experimentally proved that the proposed approach efficiently detects the duplicates in streaming data with less memory requirements.
Publié le : 2019-02-05
Classification:  Knowledge and Information Engineering; other areas of Computing and Informatics.,  Duplicate detection, stable Bloom filter, d-left hashing, FingerPrint bits, streaming dataing; FingerPrint bits; Streaming Data
@article{cai2018_6_1313,
     author = {Amritpal Singh; Department of Computer Science and Engineering, Thapar University, Patiala, Punjab and Shalini Batra; Department of Computer Science and Engineering, Thapar University, Patiala, Punjab},
     title = {FingerPrint Based Duplicate Detection in Streamed Data},
     journal = {Computing and Informatics},
     volume = {37},
     number = {6},
     year = {2019},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai2018_6_1313}
}
Amritpal Singh; Department of Computer Science and Engineering, Thapar University, Patiala, Punjab; Shalini Batra; Department of Computer Science and Engineering, Thapar University, Patiala, Punjab. FingerPrint Based Duplicate Detection in Streamed Data. Computing and Informatics, Tome 37 (2019) no. 6, . http://gdmltest.u-ga.fr/item/cai2018_6_1313/