×

Part-of-math tagging and applications. (English) Zbl 1367.68313

Geuvers, Herman (ed.) et al., Intelligent computer mathematics. 10th international conference, CICM 2017, Edinburgh, UK, July 17–21, 2017. Proceedings. Cham: Springer (ISBN 978-3-319-62074-9/pbk; 978-3-319-62075-6/ebook). Lecture Notes in Computer Science 10383. Lecture Notes in Artificial Intelligence, 356-374 (2017).
Summary: Nearly all of the recent mathematical literature, and much of the old literature, are online and mostly in natural-language form. Therefore, math content processing presents some of the same challenges faced in natural language processing (NLP), such as math disambiguation and math semantics determination. These challenges must be surmounted to enable more effective math knowledge management, math knowledge discovery, automated presentation-to-computation (P2C) conversion, and automated math reasoning. To meet this goal, considerable math language processing (MLP) technology is needed.{ }This project aims to advance MLP by developing (1) a sophisticated part-of-math (POM) tagger, (2) math-sense disambiguation techniques along with supporting machine-learning (ML) based MLP algorithms, and (3) semantics extraction from, and enrichment of, math expressions. Specifically, the project first created an evolving tagset for math terms and expressions, and is developing a general-purpose POM tagger. The tagger works in several scans and interacts with other MLP algorithms that will be developed in this project. In the first scan of an input math document, each math term and some sub-expressions are tagged with two kinds of tags. The \(1^{\mathrm{st}}\) kind consists of definite tags (such as operation, relation, numerator, etc.) that the tagger is certain of. The \(2^{\mathrm{nd}}\) kind consists of alternative, tentative features (including alternative roles and meanings) drawn from a knowledge base that has been developed for this project. The \(2^{\mathrm{nd}}\) and \(3^{\mathrm{rd}}\) scan will, in conjunction with some NLP/ML-based algorithms, select the right features from among those alternative features, disambiguate the terms, group subsequences of terms into unambiguous sub-expressions and tag them, and thus derive definite unambiguous semantics of math terms and expressions. The NLP/ML-based algorithms needed for this work will be another part of this project. These include math topic modeling, math context modeling, math document classification (into various standard areas of math), and definition-harvesting algorithms.{ }The project will create significant new concepts and techniques that will advance knowledge in two respects. First, the tagger, math disambiguation techniques, and NLP/ML-based algorithms, though they correspond to NLP and ML counterparts, will be quite novel because math expressions are radically different from natural language. Second, the project outcomes will enable the development of new advanced applications such as: (1) techniques for computer-aided semantic enrichment of digital math libraries; (2) automated P2C conversion of math expressions from natural form to (i) a machine-computable form and (ii) a formal form suitable for automated reasoning; (3) math question-answering capabilities at the manuscript level and collection level; (4) richer math UIs; and (5) more accurate math optical character recognition.
For the entire collection see [Zbl 1364.68010].

MSC:

68T30 Knowledge representation
68T50 Natural language processing
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Agirre, E., Lopez de Lacalle, A., Soroa, A.: Knowledge-based WSD on specific domains: performing better than generic supervised WSD. In: IJCAI, pp. 1501–1506 (2009)
[2] Anca, S.: Natural language and mathematics processing for applicable theorem search. Master’s thesis, Jacobs University Bremen (2009)
[3] Anderson, R.H.: Two-dimensional mathematical notation. In: Fu, K.S. (ed.) Syntactic Pattern Recognition, Applications, pp. 174–177. Springer, New York (1977)
[4] arXiv.org: https://arxiv.org/
[5] Alvaro, F., Sanchez, J.-A., Benedi, J.-M.: Recognition of printed mathematical expressions using two-dimensional context-free grammars. In: International Conference on Document Analysis and Recognition, Beijing, China, pp. 1225–1229 (2011)
[6] Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006) · Zbl 1107.68072
[7] Blei, D.: Introduction to probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
[8] Bengio, Y., LeCun, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
[9] Blei, D., Ng, A., Jordan, M., Lafferty, J.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) · Zbl 1112.68379
[10] Bowman, S., Potts, C., Manning, C.: Learning distributed word representations for natural logic reasoning. In: The AAAI Spring Symposium on Knowledge Representation and Reasoning (2015)
[11] Baker, J.B., Sexton, A.P., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) CICM 2009. LNCS, vol. 5625, pp. 201–216. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-02614-0_19 · Zbl 1247.68225
[12] Baker, J.B., Sexton, A.P., Sorge, V.: Faithful mathematical formula recognition from PDF documents. In: International Workshop on Document Analysis Systems, Boston, USA, pp. 485–492 (2010)
[13] Chan, K.-F., Yeung, D.-Y.: Mathematical expression recognition - a survey. Int. J. Doc. Anal. Recogn. 3, 3–15 (2000)
[14] Cajori, F.: A History of Mathematical Notations, vol. 2. Open Court Publishing Company, Chicago (1929) · JFM 55.0002.02
[15] Cohl, H., Schubotz, M., Youssef, A., Greiner-Petter, A., Gerhard, J., Saunders, B.V., McClain, M.A., Bang, J., Chen, K.: Semantic preserving bijective mappings of mathematical formulae between word processors and computer algebra systems. In: CICM 2017, Edingburgh, Scotland (2017) · Zbl 1367.68333
[16] Cramer, M., Fisseni, B., Koepke, P., Kühlwein, D., Schröder, B., Veldman, J.: The naproche project controlled natural language proof checking of mathematical texts. In: Fuchs, N.E. (ed.) CNL 2009. LNCS, vol. 5972, pp. 170–186. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-14418-9_11 · Zbl 05771963
[17] Cohl, H.S., McClain, M.A., Saunders, B.V., Schubotz, M., Williams, J.C.: Digital repository of mathematical formulae. In: Watt, S.M., Davenport, J.H., Sexton, A.P., Sojka, P., Urban, J. (eds.) CICM 2014. LNCS, vol. 8543, pp. 419–422. Springer, Cham (2014). doi: 10.1007/978-3-319-08434-3_30 · Zbl 1304.68194
[18] (World) Digital Mathematics Library: https://www.math.uni-bielefeld.de/ rehmann/DML/dml_links.html
[19] The European Digital Mathematics Library: https://eudml.org/ · Zbl 1279.00033
[20] Ganesalingam, M.: The Language of Mathematics. Ph.D. thesis, Cambridge University (2009) · Zbl 1271.03004
[21] Garain, U.: Identification of mathematical expressions in document images. In: International Conference on Document Analysis and Recognition, Barcelona, Spain, pp. 1340–1344 (2009)
[22] Ginev, D.: The Structure of Mathematical Expressions. Master thesis, Jacobs University Bremen, Bremen, Germany (2011)
[23] Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-of-speech tagging. In: Association for Computational Linguistics (2007)
[24] Göttinger Digitalisierungszentrum: http://gdz.sub.uni-goettingen.de/gdz/
[25] Grigore, M.: Knowledge-poor Interpretation of Mathematical Expressions in Context. Master thesis, Jacobs University Bremen, Bremen, Germany (2010)
[26] Guidi, F., Coen, S.C.: A survey on retrieval of mathematical knowledge. In: Kerber, M., Carette, J., Kaliszyk, C., Rabe, F., Sorge, V. (eds.) CICM 2015. LNCS, vol. 9150, pp. 296–315. Springer, Cham (2015). doi: 10.1007/978-3-319-20615-8_20 · Zbl 1417.68038
[27] Grigore, M., Wolska, M., Kohlhase, M.: Towards context-based disambiguation of mathematical expressions. In: The Joint Conference of ASCM 2009 and MACIS 2009, Math-for-Industry, Fukuoka, Japan (2009) · Zbl 1186.68530
[28] Hall, M., Frank, F., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. SIGKDD Explor. Newslett. 11(1), 10–18 (2009) · Zbl 05740105
[29] O’Halloran, K.L.: Mathematical Discourse: Language, Symbolism and Visual Images. Continuum, New York (2005)
[30] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, \[ 2^{\mathrm nd} \]
edn. Springer, New York (2013)
[31] Hinton, G., Salakhutdinov, R.: A better way to pretrain deep Boltzmann machines. Adv. Neural Inf. Process. Syst. 3, 1–9 (2012) · Zbl 1247.68223
[32] Hambasan, R., Kohlhase, M., Prodescu, C.: MathWebSearch at NTCIR-11. In: 10th NTCIR Conference, pp. 114–119, Tokyo, Japan (2014)
[33] Olver, F.W.J., Olde Daalhuis, A.B., Lozier, D.W., Schneider, B.I., Boisvert, R.F., Clark, C.W., Miller, B.R., Saunders, B.V., (eds.) NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/
. Release 1.0.14 of 2016-12-21
[34] Kofler, K., Neumaier, A.: DynGenPar – a dynamic generalized parser for common mathematical language. In: Jeuring, J., Campbell, J.A., Carette, J., Reis, G., Sojka, P., Wenzel, M., Sorge, V. (eds.) CICM 2012. LNCS, vol. 7362, pp. 386–401. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-31374-5_26 · Zbl 1360.68812
[35] Kohlhase, A.: Search interfaces for mathematicians. In: Watt, S.M., Davenport, J.H., Sexton, A.P., Sojka, P., Urban, J. (eds.) CICM 2014. LNCS, vol. 8543, pp. 153–168. Springer, Cham (2014). doi: 10.1007/978-3-319-08434-3_12 · Zbl 1304.68198
[36] Kohlhase, M.: Semantic Markup for Mathematical Statements. Version v1.2 (2016)
[37] Kottwitz, S.: LaTeX Beginner’s Guide. PACKT Publishing, Birmingham (2001)
[38] Libbrecht, P., Melis, E.: Methods to access and retrieve mathematical content in ActiveMath. In: Iglesias, A., Takayama, N. (eds.) ICMS 2006. LNCS, vol. 4151, pp. 331–342. Springer, Heidelberg (2006). doi: 10.1007/11832225_33 · Zbl 1283.68319
[39] Libbrecht, P.: Notations around the world: census and exploitation. In: Autexier, S., Calmet, J., Delahaye, D., Ion, P.D.F., Rideau, L., Rioboo, R., Sexton, A.P. (eds.) CICM 2010. LNCS, vol. 6167, pp. 398–410. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-14128-7_34 · Zbl 1286.68431
[40] Liska, M., Sojka, P., Ruzicka, M.: Similarity search for mathematics: Masaryk University team at the NTCIT-10 math task. In: 10th NTCIR Conference, Tokyo, Japan, pp. 686–691 (2013)
[41] Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Boston (1999)
[42] Manning, C.D.: Part-of-speech tagging from 9710.1007/978-3-642-19400-9_14 · Zbl 05906261
[43] Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing tootlkit. In: ACL (2014)
[44] Miller, B.: LaTeXML: A LaTeX to XML/HTML/MathML Converter. http://dlmf.nist.gov/LaTeXML/
[45] The database MathSciNet: http://www.ams.org/mathscinet/
[46] Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, London (2012) · Zbl 1295.68003
[47] Malon, C.D., Uchida, S., Suzuki, M.: Mathematical symbol recognition with support vector machines. Pattern Recogn. Lett. 29, 1326–1332 (2008)
[48] Navigli, R.: Word sense disambiguation: a survey. ACM Comput. Surv. 41(2), 1–69 (2009)
[49] Neumaier, A., Schodl, P.: A framework for representing and processing arbitrary mathematics. In: The International Conference on Knowledge Engineering and Ontology Development, pp. 476–479 (2010)
[50] Nghiem, M.-Q., Yokoi, K., Matsubayashi, Y., Aizawa, A.: Mining coreference relations between formulas and text using Wikipedia. In: Second Workshop on NLP Challenges in the Information Explosion Era, Beijing, China, pp. 69–74 (2010)
[51] Robertson, W.: Every Symbol (most Symbols) Defined by Unicode-Math (2015)
[52] Santorini, B.: Part-of-speech tagging guidelines for the Penn treebank project. 3rd Revision, University of Pennsylvania (1990)
[53] Schöneberg, U., Sperber, W.: POS tagging and its applications for mathematics. In: Watt, S.M., Davenport, J.H., Sexton, A.P., Sojka, P., Urban, J. (eds.) CICM 2014. LNCS, vol. 8543, pp. 213–223. Springer, Cham (2014). doi: 10.1007/978-3-319-08434-3_16 · Zbl 1304.68180
[54] Schubotz, M., Grigorev, A., Leich, M., Cohl, H.S., Meuschke, N., Gippx, B., Youssef, A., Markl, V.: Semantification of identifiers in mathematics for better math information retrieval. In: The 39th Annual ACM SIGIR Conference (SIGIR 2016), Pisa, Italy, pp. 135–144 (2016)
[55] Stamerjohanns, H., Kohlhase, M., Ginev, D., David, C., Miller, B.: Transforming large collections of scientific publications to XML. Math. Comput. Sci. 3(3), 299–307 (2010). Birkhäuser · Zbl 1205.68490
[56] Socher, R., Lin, C., Ng, A.Y., Manning, C.D.: Parsing natural scenes and natural language with recursive neural networks. In: ICML (2011)
[57] Smirnova, E., Watt, S.M.: Notation selection in mathematical computing environments. In: Transgressive Computing 2006: A conference in honor of Jean Della Dora (TC 2006), Granada, Spain, pp. 339–355 (2006) · Zbl 1204.68276
[58] Søgaard, A.: Simple semi-supervised training of part-of-speech taggers. In: The ACL Conference Short Papers, pp. 205–208 (2010)
[59] So, C.M., Watt, S.M.: Determining empirical characteristics of mathematical expression use. In: Kohlhase, M. (ed.) MKM 2005. LNCS, vol. 3863, pp. 361–375. Springer, Heidelberg (2006). doi: 10.1007/11618027_24 · Zbl 1151.68675
[60] Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: ACM Symposium on Document Engineering, Grenoble, France, pp. 95–104 (2003)
[61] Uchida, S., Nomura, A., Suzuki, M.: Quantitative analysis of mathematical documents. Int. J. Doc. Anal. Recogn. 7(4), 211–218 (2005) · Zbl 05027763
[62] Vapnik, V.N.: The Nature of Statistical Machine Learning, \[ 2^{\mathrm nd} \]
edn. Springer, Heidelberg (2000)
[63] Watt, S.M.: Exploiting implicit mathematical semantics in conversion between TEX and MathML. TUGBoat 23(1), 108 (2002)
[64] Watt, S.M.: An empirical measure on the set of symbols occurring in engineering mathematics texts. In: International Workshop on Document Analysis Systems, Nara, Japan, pp. 557–564 (2008)
[65] Wolska, M., Grigore, M.: Symbol declarations in mathematical writing: a corpus study. In: Towards Digital Mathematics Library, DML workshop, pp. 119–127. Masaryk University, Brno (2010)
[66] Wolska, M., Grigore, M., Kohlhase, M.: Using discourse context to interpret object-denoting mathematical expressions. In: Towards Digital Mathematics Library, DML workshop, pp. 85–101. Masaryk University, Brno (2011)
[67] Yang, M., Fateman, R.: Extracting mathematical expressions from postscript documents. In: ISSAC 2004, pp. 305–311. ACM Press (2004) · Zbl 1134.68490
[68] Youssef, A.: Roles of math search in mathematics. In: Borwein, J.M., Farmer, W.M. (eds.) MKM 2006. LNCS, vol. 4108, pp. 2–16. Springer, Heidelberg (2006). doi: 10.1007/11812289_2 · Zbl 1188.68128
[69] Youssef, A.: Relevance ranking and hit description in math search. Math. Comput. Sci. 2(2), 333–353 (2008) · Zbl 1178.68218
[70] Yu, B., Tian, X., Luo, W.: Extracting mathematical components directly from pdf documents for mathematical expression recognition and retrieval. In: Tan, Y., Shi, Y., Coello, C.A.C. (eds.) ICSI 2014. LNCS, vol. 8795, pp. 170–179. Springer, Cham (2014). doi: 10.1007/978-3-319-11897-0_20 · Zbl 06491582
[71] Zanibbi, R., Aizawa, A., Kohlhase, M., Ounis, I., Topic, G., Davila, K.: NTCIR-12 MathIR task overview. In: NTCIR-12, Tokyo, Japan (2016)
[72] Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Doc. Anal. Recogn. 15(4), 331–357 (2012)
[73] The database zbMATH: http://www.zentralblatt-math.org/zbmath/
[74] Zhang, Q., Youssef, A.: Performance evaluation and optimization of math-similarity search. In: Kerber, M., Carette, J., Kaliszyk, C., Rabe, F., Sorge, V. (eds.) CICM 2015. LNCS, vol. 9150, pp. 243–257. Springer, Cham (2015). doi: 10.1007/978-3-319-20615-8_16 · Zbl 1417.68275
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.