×

Good-Turing frequency estimation in a finite population. (English) Zbl 1310.62115

Summary: Good-Turing frequency estimation [I. J. Good, Biometrika 40, 237–264 (1953; Zbl 0051.37103)] is a simple, effective method for predicting detection probabilities of objects of both observed and unobserved classes based on observed frequencies of classes in a sample. The method has been used widely in several disciplines, such as information retrieval, computational linguistics, text recognition, and ecological diversity estimation. Nevertheless, existing studies assume sampling with replacement or sampling from an infinite population, which might be inappropriate for many practical applications. In light of this limitation, this article presents a modification of the Good-Turing estimation method to account for finite population sampling. We provide three practical extensions of the modified method, and we examine performance of the modified method and its extensions in simulation experiments.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
92D40 Ecology
62B10 Statistical aspects of information-theoretic topics
68T10 Pattern recognition, speech recognition

Citations:

Zbl 0051.37103

Software:

zipfR
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Basharin, On a statistical estimate for the entropy of a sequence of independent random variables, Theory of Probability and Its Applications 4 pp 333– (1959) · doi:10.1137/1104033
[2] Chao, Estimating the number of classes via sample coverage, Journal of American Statistical Association 87 pp 210– (1992) · Zbl 0850.62145 · doi:10.1080/01621459.1992.10475194
[3] Chao, Estimating population size for capture-recapture data when capture probabilities vary by time and individual animal, Biometrics 48 pp 201– (1992) · Zbl 0767.62091 · doi:10.2307/2532750
[4] Chao, Nonparametric lower bounds for species richness and shared species richness under sampling without replacement, Biometrics 68 pp 912– (2012) · Zbl 1271.62276 · doi:10.1111/j.1541-0420.2011.01739.x
[5] Chao, Nonparametric estimation of Shannon’s index of diversity when there are unseen species, Environmental and Ecological Statistics 10 pp 429– (2003) · doi:10.1023/A:1026096204727
[6] Chen, An empirical study of smoothing techniques for language modeling, Computer Speech and Language 13 pp 310– (1999) · Zbl 01938846 · doi:10.1006/csla.1999.0128
[7] Condit, Changes in a tropical forest with a shifting climate: results from a 50-ha permanent census plot in Panama, Journal of Tropical Ecology 12 pp 231– (1996) · doi:10.1017/S0266467400009433
[8] Colwell, Models and estimators linking individual-based and sample-based rarefaction, extrapolation, and comparison of assemblages, Journal of Plant Ecology 5 pp 3– (2012) · doi:10.1093/jpe/rtr044
[9] Cecconi, A new estimator for the number of species in a population, Sankhya A 74 pp 80– (2012) · Zbl 1281.62088 · doi:10.1007/s13171-012-0012-x
[10] Church, Word association norms mutual information, and lexicography, Computational Linguistics 16 pp 22– (1990)
[11] Esty, A normal limit law for a nonparametric estimator of the coverage of a random sample, The Annals of Statistics 11 pp 905– (1983) · Zbl 0599.62053 · doi:10.1214/aos/1176346256
[12] Esty, Estimation of the number of classes in a population and the coverage of a population, Mathematical Scientist 10 pp 41– (1985) · Zbl 0585.62026
[13] Evert , S. Baroni , M. 2007 zipfR: Word frequency distributions in R Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions, pages 29-32, Prague, CZ (R package version 0.6-6 of 2012-04-03)
[14] Fisher, The relation between the number of species and the number of individuals in a random sample of an animal population, Journal of Animal Ecology 12 pp 42– (1943) · doi:10.2307/1411
[15] Gneiting, Strictly proper scoring rules, prediction, and estimation, Journal of the American Statistical Association 102 pp 359– (2007) · Zbl 1284.62093 · doi:10.1198/016214506000001437
[16] Good, The population of frequencies of species and the estimation of population parameters, Biometrika 40 pp 45– (1953) · Zbl 0051.37103 · doi:10.1093/biomet/40.3-4.237
[17] Good, Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma, Journal of Statistical Computation and Simulation 66 pp 101– (2000) · Zbl 1054.62004 · doi:10.1080/00949650008812016
[18] Goodman, On the estimation of the number of classes in a population, Annals of Mathematical Statistics 20 pp 572– (1949) · Zbl 0035.09102 · doi:10.1214/aoms/1177729949
[19] Haas , P. J. Stokes , L. 1996 Estimating the number of classes in a finite population IBM Research Report RJ 10025, IBM Almaden Research Center, San Jose, CA, Revised March 1998
[20] Haas, Estimating the number of classes in a finite population, Journal of the American Statistical Association 93 pp 1475– (1998) · Zbl 1063.62519 · doi:10.1080/01621459.1998.10473807
[21] Haas, An estimator of number of species from quadrat sampling, Biometrics 62 pp 135– (2006) · Zbl 1091.62116 · doi:10.1111/j.1541-0420.2005.00390.x
[22] Hausser, Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks, Journal of Machine Learning Research 10 pp 1469– (2009) · Zbl 1235.62006
[23] Jelinek, Statistical Methods for Speech Recognition (1998)
[24] Johnson, Univariate Discrete Distribution (1992)
[25] Kucera, Computational Analysis of Present-day American English (1967)
[26] Lo, From the species problem to a general coverage problem via a new interpretation, The Annals of Statistics 20 pp 1094– (1992) · Zbl 0778.62028 · doi:10.1214/aos/1176348672
[27] Magurran, Ecological Diversity and Its Measurement (1988) · doi:10.1007/978-94-015-7358-0
[28] McAllester , D. Schapire , R. E. 2000 On the convergence rate of Good-Turing estimators Proc. 13th Annu. Conference on Comput. Morgan Kaufmann Learning Theory San Francisco, CA 1 6
[29] Miller, Documenting completeness, species-area relations, and the species-abundance distribution of a regional flora, Ecology 70 pp 16– (1989) · doi:10.2307/1938408
[30] Mingoti, Estimating the total number of distinct species using presence and absence data, Biometrics 48 pp 863– (1989) · doi:10.2307/2532351
[31] Orlitsky, Always Good Turing: Asymptotically optimal probability estimation, Science 302 pp 427– (2003) · Zbl 1226.01008 · doi:10.1126/science.1088284
[32] Shen, Predicting the number of new species in further taxonomic sampling, Ecology 84 pp 798– (2003) · doi:10.1890/0012-9658(2003)084[0798:PTNONS]2.0.CO;2
[33] Shlosser, On estimation of the size of the dictionary of a long text on the basis of a sample, Engineering Cybernetics 19 pp 97– (1981) · Zbl 0507.62007
[34] Song, Research and Development in Information Retrieval (1999)
[35] Valiant, Proceedings of the forty-third annual ACM symposium on Theory of computing (STOC’11), 685-694 (2011)
[36] Wagner, Strong consistency of the Good-Turing estimator, IEEE Symposium on Information Theory Proceeding pp 2526– (2006)
[37] Zhang, Asymptotic normality of a nonparametric estimator of sample coverage, The Annals of Statistics 37 pp 2582– (2009) · Zbl 1173.62015 · doi:10.1214/08-AOS658
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.