Miner, Robert; Munavalli, Rajesh An approach to mathematical search through query formulation and data normalization. (English) Zbl 1202.68130 Kauers, Manuel (ed.) et al., Towards mechanized mathematical assistants. 14th symposium, Calculemus 2007, 6th international conference, MKM 2007, Hagenberg, Austria, June 27–30, 2007. Proceedings. Berlin: Springer (ISBN 978-3-540-73083-5/pbk). Lecture Notes in Computer Science 4573. Lecture Notes in Artificial Intelligence, 342-355 (2007). Summary: This article describes an approach to searching for mathematical notation. The approach aims at a search system that can be effectively and economically deployed, and that produces good results with a large portion of the mathematical content freely available on the World Wide Web today. The basic concept is to linearize mathematical notation as a sequence of text tokens, which are then indexed by a traditional text search engine. However, naive generalization of the “phrase query” of text search to mathematical expressions performs poorly. For adequate precision and recall in the mathematical context, more complex combinations of atomic queries are required. Our approach is to query for a weighted collection of significant subexpressions, where weights depend on expression complexity, nesting depth, expression length, and special boosting of well-known expressions. To make this approach perform well with the technical content that is readily obtainable on the World Wide Web, either directly or through conversion, it is necessary to extensively normalize mathematical expression data to eliminate accidently or irrelevant encoding differences. To do this, a multi-pass normalization process is applied. In successive stages, MathML and XML errors are corrected, character data is canonicalized, white space and other insignificant data is removed, and heuristics are applied to disambiguated expressions. Following these preliminary stages, the MathML tree structure is canonicalized via an augmented precedence parsing step. Finally, mathematical synonyms and some variable names are canonicalized.For the entire collection see [Zbl 1119.68011]. Cited in 6 Documents MSC: 68P10 Searching and sorting 68M11 Internet topics 68P20 Information storage and retrieval of data × Cite Format Result Cite Review PDF Full Text: DOI