HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
Flajolet, Philippe ; Fusy, Éric ; Gandouet, Olivier ; Meunier, Frédéric
HAL, hal-00406166 / Harvested from HAL
This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of \emphdistinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes''), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about $1.04/\sqrt{m}$. This improves on the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond $10^9$ with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.
Publié le : 2007-06-17
Classification:  cardinality estimation,  Probabilistic algorithm,  [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS],  [INFO.INFO-DM]Computer Science [cs]/Discrete Mathematics [cs.DM],  [MATH.MATH-CO]Mathematics [math]/Combinatorics [math.CO],  [INFO.INFO-CG]Computer Science [cs]/Computational Geometry [cs.CG]
@article{hal-00406166,
     author = {Flajolet, Philippe and Fusy, \'Eric and Gandouet, Olivier and Meunier, Fr\'ed\'eric},
     title = {HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm},
     journal = {HAL},
     volume = {2007},
     number = {0},
     year = {2007},
     language = {en},
     url = {http://dml.mathdoc.fr/item/hal-00406166}
}
Flajolet, Philippe; Fusy, Éric; Gandouet, Olivier; Meunier, Frédéric. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. HAL, Tome 2007 (2007) no. 0, . http://gdmltest.u-ga.fr/item/hal-00406166/