Extracting precise data on the mathematical content of PDF documents. (English) Zbl 1170.68481
Sojka, Petr (ed.), DML 2008. Towards digital mathematics library, Birmingham, UK, July 27th, 2008. Proceedings. Brno: Masaryk University (ISBN 978-80-210-4658-0/pbk). 75-79 (2008).
Summary: As more and more scientific documents become available in PDF format, their automatic ananlysis becomes increasingly important. We present a procedure that extracts mathematical symbols from PDF documents by examining both the original PDF file and a rasterized version. This provides more precise information than is available either directly from the PDF file or by traditional character recognition techniques. The data can then be used to improve mathematical parsing methods that transform the mathematics into richer formats such as MathML.
