×

A unified scheme for generalizing cardinality estimators to sum aggregation. (English) Zbl 1302.68306

Summary: Cardinality estimation algorithms receive a stream of elements that may appear in arbitrary order, with possible repetitions, and return the number of distinct elements. Such algorithms usually seek to minimize the required storage at the price of inaccuracy in their output. This paper shows how to generalize every cardinality estimation algorithm that relies on extreme order statistics (min/max sketches) to a weighted version, where each item is associated with a weight and the goal is to estimate the total sum of weights. The proposed unified scheme uses the unweighted estimator as a black-box, and manipulates the input using properties of the beta distribution.

MSC:

68W01 General topics in the theory of algorithms
62-04 Software, source code, etc. for problems pertaining to statistics
62G05 Nonparametric estimation
62G30 Order statistics; empirical distribution functions
62G32 Statistics of extreme values; tail inference

Software:

HyperLogLog
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Chassaing, P.; Gérin, L., Efficient estimation of the cardinality of large data sets, (Proceedings of the 4th Colloquium on Mathematics and Computer Science. Proceedings of the 4th Colloquium on Mathematics and Computer Science, Discrete Math. Theor. Comput. Sci. Proc., vol. AG (2006)), 419-422 · Zbl 1191.68378
[2] Clifford, P.; Cosma, I. A., A statistical analysis of probabilistic counting algorithms, Scand. J. Stat. (2011)
[3] Cohen, E., Size-estimation framework with applications to transitive closure and reachability, J. Comput. Syst. Sci., 55, 3, 441-453 (1997) · Zbl 0897.68075
[4] Cohen, E.; Kaplan, H., Tighter estimation using bottom k sketches, Proc. VLDB Endow., 1, 1, 213-224 (2008)
[5] Considine, J.; Li, F.; Kollios, G.; Byers, J. W., Approximate aggregation techniques for sensor databases, (ICDE (2004)), 449-460
[6] Flajolet, P.; Fusy, É.; Gandouet, O.; Meunier, F., HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm, (Analysis of Algorithms. Analysis of Algorithms, AofA 2007. Analysis of Algorithms. Analysis of Algorithms, AofA 2007, Discrete Math. Theor. Comput. Sci. (2007)) · Zbl 1192.68959
[7] Flajolet, P.; Martin, G. N., Probabilistic counting algorithms for data base applications, J. Comput. Syst. Sci., 31, 182-209 (Sep. 1985)
[8] Fusy, É.; Giroire, F., Estimating the number of active flows in a data stream over a sliding window, (Panario, D.; Sedgewick, R., ANALCO (2007), SIAM), 223-231 · Zbl 1430.68105
[9] Ganguly, S.; Garofalakis, M. N.; Rastogi, R.; Sabnani, K. K., Streaming algorithms for robust, real-time detection of DDoS attacks, (ICDCS (2007), IEEE Computer Society), 4
[10] Giroire, F., Directions to use probabilistic algorithms for cardinality for DNA analysis, Journ. Ouvert. Biol. Inform. Math. (2006)
[11] Giroire, F., Order statistics and estimating cardinalities of massive data sets, Discrete Appl. Math., 157, 406-427 (2009) · Zbl 1169.68054
[12] Heule, S.; Nunkesser, M.; Hall, A., HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm, (Proceedings of the EDBT 2013 Conference (2013))
[13] Indyk, P., Stable distributions, pseudorandom generators, embeddings, and data stream computation, J. ACM, 53, 307-323 (May 2006)
[14] Krishnamoorthy, K., Handbook of Statistical Distributions with Applications (2006), Chapman & Hall/CRC Press: Chapman & Hall/CRC Press Boca Raton, FL · Zbl 1111.62011
[15] Lumbroso, J., An optimal cardinality estimation algorithm based on order statistics and its full analysis, (Analysis of Algorithms. Analysis of Algorithms, AofA 2010. Analysis of Algorithms. Analysis of Algorithms, AofA 2010, Discrete Math. Theor. Comput. Sci. (2010)) · Zbl 1360.62241
[16] Metwally, A.; Agrawal, D.; Abbadi, A. E., Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic, (Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology. Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology, EDBT’08 (2008)), 618-629
[17] Qian, C.; Ngan, H.; Liu, Y.; Ni, L. M., Cardinality estimation for large-scale RFID systems, IEEE Trans. Parallel Distrib. Syst., 22, 9, 1441-1454 (2011)
[18] Rubinstein, R. Y.; Kroese, D. P., Simulation and the Monte Carlo Method, Wiley Series in Probability and Statistics (2008) · Zbl 1147.68831
[19] Walck, C., Handbook on Statistical Distributions for Experimentalists (Dec. 1996)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.