Tilahun, Gelila; Feuerverger, Andrey; Gervers, Michael Dating medieval English charters. (English) Zbl 1257.62004 Ann. Appl. Stat. 6, No. 4, 1615-1640 (2012). Summary: Deeds, or charters, dealing with property rights, provide a continuous documentation which can be used by historians to study the evolution of social, economic and political changes. This study is concerned with charters (written in Latin) dating from the tenth through early fourteenth centuries in England. Of these, at least one million were left undated, largely due to administrative changes introduced by William the Conqueror in 1066. Correctly dating such charters is of vital importance in the study of English medieval history. This paper is concerned with computer-automated statistical methods for dating such document collections, with the goal of reducing the considerable efforts required to date them manually and of improving the accuracy of assigned dates. The proposed methods are based on such data as the variation over time of word and phrase usage, and on measures of distance between documents. The extensive (and dated) Documents of Early England Data Set (DEEDS) maintained at the University of Toronto was used for this purpose. Cited in 2 Documents MSC: 62-07 Data analysis (statistics) (MSC2010) 91F10 History, political science 62P99 Applications of statistics 68U99 Computing methodologies and applications 65C60 Computational problems in statistics (MSC2010) Keywords:bandwidth selection; cross-validation; medieval charters; DEEDS data set; generalized linear models; kernel smoothing; local log-likelihood; maximum prevalence method; nearest neighbor methods (kNN); quantile regression; text mining Software:ElemStatLearn; KernSmooth × Cite Format Result Cite Review PDF Full Text: DOI arXiv Euclid References: [1] Berry, M. W. and Browne, M. (2005). Understanding Search Engines-Mathematical Modeling and Text Retrieval , 2nd ed. SIAM, Philadelphia. · Zbl 1075.68591 [2] Broder, A. Z. (1998). On the resemblance and containment of documents. In International Conference on Compression and Complexity of Sequences ( SEQUENCES’ 97), June 11 - 13 1997, Positano , Italy 21-29. IEEE Comput. Soc., Los Alamitos, CA. [3] de Jong, F., Rode, H. and Hiemstra, D. (2005). Temporal language models for the disclosure of historical text. In Proc. 16 th Int. Conf. of the Assoc. for History and Computing 161-168. KNAW, Amsterdam. [4] Djeraba, C. (2003). Multimedia Mining-A Highway to Intelligent Multimedia Documents . Kluwer, Boston. [5] Domingos, P. and Pazzani, M. (1996). Beyond independence: Conditions for optimality of the Bayes classifier. In Proceedings of the 13 th International Conference on Machine Learning 105-112. Association for Computing Machinery, New York. [6] Fan, J. and Gijbels, I. (2000). Local polynomial fitting. In Smoothing and Regression : Approaches , Computation , and Application (M. G. Schimek, ed.) 229-276. Wiley, New York. · Zbl 1064.62517 [7] Feuerverger, A., He, Y. and Khatri, S. (2012). Statistical significance of the Netflix challenge. Statist. Sci. 27 202-231. · Zbl 1330.62090 · doi:10.1214/11-STS368 [8] Feuerverger, A., Hall, P., Tilahun, G. and Gervers, M. (2005). Distance measures and smoothing methodology for imputing features of documents. J. Comput. Graph. Statist. 14 255-262. · doi:10.1198/106186005X47291 [9] Feuerverger, A., Hall, P., Tilahun, G. and Gervers, M. (2008). Using statistical smoothing to date medieval manuscripts. In Beyond Parametrics in Interdisciplinary Research : Festschrift in Honor of Professor Pranab K. Sen (N. Balakrishnan, E. Pena, M. J. Silvapulle, eds.). Inst. Math. Stat. Collect. 1 321-331. Inst. Math. Statist., Beachwood, OH. · doi:10.1214/193940307000000248 [10] Fiallos, R. (2000). An overview of the process of dating undated medieval charters: Latest results and future developments. In Dating Undated Medieval Charters (M. Gervers, ed.). Boydell Press, Woodbridge. [11] Gervers, M. (2000). Dating Undated Medieval Charters . Boydell Press, Woodbridge. · Zbl 1257.62004 [12] Gervers, M. and Hamonic, N. (2010). Pro amore dei : Diplomatic evidence of social conflict during the reign of King John. [13] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning : Data Mining , Inference , and Prediction , 2nd ed. Springer, New York. · Zbl 1273.62005 [14] Kanhabua, N. and Norvag, K. (2008). Improving Temporal Language Models for Determining Time of Non-Timestamped Documents. Lecture Notes in Computer Science 5173 . Springer, Berlin. [15] Kanhabua, N. and Norvag, K. (2009). Using Temporal Language Models for Documents Dating. Lecture Notes in Computer Science 5782 . Springer, Berlin. [16] Koenker, R. (2005). Quantile Regression. Econometric Society Monographs 38 . Cambridge Univ. Press, Cambridge. · Zbl 1111.62037 [17] Loader, C. (1999). Local Regression and Likelihood . Springer, New York. · Zbl 0929.62046 · doi:10.1007/b98858 [18] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM J. Res. Develop. 2 159-165. · doi:10.1147/rd.22.0159 [19] Manning, C., Raghavan, P. and Schütze, H. (2008). Introduction to Information Retrieval . Cambridge Univ. Press, New York. · Zbl 1160.68008 · doi:10.1017/CBO9780511809071 [20] McGill, M., Koll, M. and Noreault, T. (1979). An evaluation of factors affecting document ranking by information retrieval systems. Technical Report. School of Information Studies, Syracuse Univ., Syracuse, NY. [21] Mosteller, F. and Wallace, D. (1963). Inference in an authorship problem. J. Amer. Statist. Assoc. 58 275-302. · Zbl 0124.10401 · doi:10.2307/2283270 [22] Nadaraya, E. A. (1964). On estimating regression. Theory Probab. Appl. 10 186-190. · Zbl 0134.36302 [23] Quang, P. X., James, B., James, K. L. and Levina, L. (1999). Document similarity measure for the vector space model in information retrieval. NSASAG Problem 99-5. [24] Salton, G., Wang, A. and Yang, C. (1975). A vector space model for information retrieval. J. Amer. Soc. Inf. Sci. 18 613-620. · Zbl 0313.68082 · doi:10.1145/361219.361220 [25] Simonoff, J. S. (1996). Smoothing Methods in Statistics . Springer, New York. · Zbl 0859.62035 [26] Tan, P. N., Steinbach, M. and Kumar, V. (2005). Introduction to Data Mining . Addison-Wesley, Reading. [27] Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Monographs on Statistics and Applied Probability 60 . Chapman & Hall, London. · Zbl 0854.62043 [28] Watson, G. S. (1964). Smooth regression analysis. Sankhyā Ser. A 26 359-372. · Zbl 0137.13002 [29] Zhang, J. and Korfhagen, R. (1999). A distance and angle similarity measure method. J. Amer. Soc. Inf. Sci. 50 772-778. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.