×

Mathematical document classification via symbol frequency analysis. (English) Zbl 1170.68494

Sojka, Petr (ed.), DML 2008. Towards digital mathematics library, Birmingham, UK, July 27th, 2008. Proceedings. Brno: Masaryk University (ISBN 978-80-210-4658-0/pbk). 29-40 (2008).
Summary: Earlier work has examined frequency of symbol and expression use in mathematical documents for various purposes including mathematical handwriting recognition and forming the most natural output from computer algebra systems.
This work has found, unsurprisingly, that the particulars of symbol and expression vary from area to area and, in particular, between different top-level subjects of the 2000 Mathematical Subject Classification. If the area of mathematics is known in advance, then an area-specific information can be used for the recognition or output problem. What is more interesting is that although the specifics of which symbols are ranked as most frequent vary from area to area, the shape of the relative frequency curve remains the same.
The present work examines the inverse problem: Given the relative frequencies of symbols in a document, is it possible to classify the document and determine the most likely area of mathematics of the work? We examine the symbol frequency \`\` fingerprints” for the different areas of the Mathematical Subject Classification.
For the entire collection see [Zbl 1158.00018].

MSC:

68P99 Theory of data
PDFBibTeX XMLCite
Full Text: EuDML