Automated classification and categorization of mathematical knowledge. (English) Zbl 1166.68358

Autexier, Serge (ed.) et al., Intelligent computer mathematics. 9th international conference, AISC 2008, 15th symposium, Calculemus 2008, 7th international conference, MKM 2008, Birmingham, UK, July 28–August 1, 2008. Proceedings. Berlin: Springer (ISBN 978-3-540-85109-7/pbk). Lecture Notes in Computer Science 5144. Lecture Notes in Artificial Intelligence, 543-557 (2008).
Summary: There is a common Mathematics Subject Classification (MSC) System used for categorizing mathematical papers and knowledge. We present results of machine learning of the MSC on full texts of papers in the mathematical digital libraries DML-CZ and NUMDAM. The F1-measure achieved on classification task of top-level MSC categories exceeds 89%. We describe and evaluate our methods for measuring the similarity of papers in the digital library based on paper full texts.
For the entire collection see [Zbl 1154.68002].


68T30 Knowledge representation
00A35 Methodology of mathematics
68T05 Learning and adaptive systems in artificial intelligence


Full Text: DOI Link


[1] Royal Society of London: Catalogue of scientific papers 1800-1900 vol. 1-19 and Subject Index in 4 vols (published, 1867-1925) (1908), free electronic version available by project Gallica http://gallica.bnf.fr/
[2] Ohrtmann, C., Müller, F., (eds.): Jahrbuch über die Fortschritte der Mathematik vol. 1-68 (1868-1942) Druck und Verlag von Georg Reimer, Berlin (1871-1942); electronic version available by project ERAM, http://www.emis.de/projects/JFM/
[3] Bouche, T.: Towards a Digital Mathematics Library? In: Rocha, E.M. (ed.) CMDE 2006: Communicating Mathematics in the Digital Era, pp. 43-68. A.K. Peters, MA, USA (2008)
[4] Sojka, P.: From Scanned Image to Knowledge Sharing. In: Tochtermann, K., Maurer, H. (eds.) Proceedings of I-KNOW 2005: Fifth International Conference on Knowledge Management, Graz, Austria, Know-Center in coop, Graz Uni, pp. 664-672. Joanneum Research and Springer Pub. Co (2005)
[5] Bartošek, M., Lhoták, M., Rákosník, J., Sojka, P., Šárfy, M.: DML-CZ: The Objectives and the First Steps. In: Borwein, J., Rocha, E.M., Rodrigues, J.F. (eds.) CMDE 2006: Communicating Mathematics in the Digital Era, pp. 69-79. A.K. Peters, MA, USA (2008)
[6] Dunning, T.: Statistical identification of language. Technical Report MCCS 94-273, New Mexico State University, Computing Research Lab (1994)
[7] Sojka, P., Panák, R., Mudrák, T.: Optical Character Recognition of Mathematical Texts in the DML-CZ Project. Technical report, Masaryk University, Brno. CMDE 2006 conference in Aveiro, Portugal (presented, 2006)
[8] Pomikálek, J.; Řehůřek, R., The Influence of Preprocessing Parameters on Text Categorization, International Journal of Applied Science, Engineering and Technology, 1, 430-434 (2007)
[9] Sebastiani, F., Machine learning in automated text categorization, ACM Computing Surveys, 34, 1-47 (2002) · doi:10.1145/505282.505283
[10] Yang, Y., Joachims, T.: Text categorization. Scholarpedia (2008), http://www.scholarpedia.org/article/Text_categorization
[11] Manning, C. D.; Raghavan, P.; Schütze, H., Introduction to Information Retrieval (2008), Cambridge: Cambridge University Press, Cambridge · Zbl 1160.68008
[12] Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Linguistic Analysis, pp. 191-202 (1993) · Zbl 0938.68863
[13] Yang, Y.; Pedersen, J. O.; Fisher, D. H., A comparative study on feature selection in text categorization, Proceedings of ICML 1997, 14th International Conference on Machine Learning, 412-420 (1997), San Francisco: Morgan Kaufmann, San Francisco
[14] Galavotti, L.; Sebastiani, F.; Simi, M.; Borbinha, J. L.; Baker, T., Experiments on the use of feature selection and negative evidence in automated text categorization, Research and Advanced Technology for Digital Libraries, 59-68 (2000), Heidelberg: Springer, Heidelberg · doi:10.1007/3-540-45268-0_6
[15] Forman, G., An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research, 3, 1289-1305 (2003) · Zbl 1102.68553 · doi:10.1162/153244303322753670
[16] Salton, G.; Buckley, C., Term-weighting approaches in automatic text retrieval, Information Processing and Management, 24, 513-523 (1988) · doi:10.1016/0306-4573(88)90021-0
[17] Lee, J.H.: Analyses of multiple evidence combination. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Combination Techniques, pp. 267-276 (1997)
[18] Yang, Y.; Croft, W. B.; Harper, D. J.; Kraft, D. H.; Zobel, J., A Study on Thresholding Strategies for Text Categorization, Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001), 137-145 (2001), New York: ACM Press, New York · doi:10.1145/383952.383975
[19] Gandrabur, S.; Foster, G.; Lapalme, G., Confidence Estimation for NLP Applications, ACM Transactions on Speech and Language Processing, 3, 1-29 (2006) · doi:10.1145/1177055.1177057
[20] Esuli, A., Fagni, T., Sebastiani, F.: Boosting multi-label hierarchical text categorization. Information Retrieval 11 (2008)
[21] Allen, J. A., The international catalogue of scientific literature, The Auk, 21, 494-501 (1904)
[22] Rusin, D.: The Mathematical Atlas—A Gateway to Modern Mathematics (2002), http://www.math-atlas.org/welcome.html
[23] Deerwester, S. C.; Dumais, S. T.; Landauer, T. K.; Furnas, G. W.; Harshman, R. A., Indexing by latent semantic analysis, Journal of the American Society of Information Science, 41, 391-407 (1990) · doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
[24] Bengio, Y.; Lamblin, P.; Popovici, D.; Larochelle, H.; Schölkopf, B.; Platt, J.; Hoffman, T., Greedy layer-wise training of deep networks, Advances in Neural Information Processing Systems 19, 153-160 (2007), Cambridge: MIT Press, Cambridge
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.