Transforming large collections of scientific publications to XML. (English) Zbl 1205.68490

Summary: We describe an experiment transforming large collections of LaTeX documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive arXiv) using LaTeXML, a LaTeX to XML converter currently under development. While the long-term goal is a large body of scientific documents available for semantic analysis, search indexing and other experimentation, the immediate goals are tools for creating such corpora. The first task of our arXMLiv project is to develop LaTeXML bindings for the (thousands of) LaTeX classes and packages used in the arXiv collection, as well as methods for coping with the eccentricities that TeX encourages. We have created a distributed build system that runs LaTeXML over the collection, in part or entirely, while collecting statistics about missing bindings and other errors. This guides debugging and development efforts, leading to iterative improvements in both the tools and the quality of the converted corpus. The build system thus serves as both a production conversion engine and software test harness. We have now processed the complete arXiv collection through 2006 consisting of more than 400,000 documents (a complete run is a processor-year-size undertaking), continuously improving our success rate. We are now able to convert more than 90% of these documents to XHTML+MathML. We consider over 60% to be successes, converted with no or minor warnings. While the remaining 30% can also be converted, their quality is doubtful, due to unsupported macros or conversion errors.


68U15 Computing methodologies for text processing; mathematical typography
Full Text: DOI


[1] Ausbrooks, R., Carlisle, S.B.D., Chavchanidze, G., Dalmas, S., Devitt, S., Diaz, A., Dooley, S., Hunter, R., Ion, P., Kohlhase, M., Lazrek, A., Libbrecht, P., Miller, B., Miner, R., Sargent, M., Smith, B., Soiffer, N., Sutor, R., Watt, S.: Mathematical Markup Language (MathML) version 3.0. W3C Working Draft of 24. Sep. 2009, World Wide Web Consortium, 2009
[2] Anghelache, R.: Hermes–a semantic XML + MathML + Unicode e-publishing/self-archiving tool for L a TEX authored scientific articles. http://hermes.roua.org/ (2007)
[3] arXiv.org e-Print archive. http://www.arxiv.org . December 2007
[4] Buswell, S., Caprotti, O., Carlisle, D.P., Dewar, M.C., Gaetano, M., Kohlhase, M.: The Open Math standard, version 2.0. Technical report, The Open Math Society (2004)
[5] Ginev, D., Jucovschi, C., Anca, S., Grigore, M., David, C., Kohlhase, M.: An architecture for linguistic and semantic analysis on the arXMLiv corpus. In: Applications of semantic technologies (AST) workshop at informatik (2009)
[6] Kohlhase, M.: OMDoc–an open markup format for mathematical documents [Version 1.2]. Number 4180 in LNAI. Springer, Berlin (2006)
[7] Kohlhase, M.: Using L a TEX as a semantic markup format. Mathematics in Computer Science, pp. 279–304 (2008) · Zbl 1176.68230
[8] Kohlhase, M., Şucan, I.: A search engine for mathematical formulae. In: Ida, T., Calmet, J., Wang, D. (eds.) Proceedings of Artificial Intelligence and Symbolic Computation, AISC’2006, number 4120 in LNAI, pp. 241–253. Springer, Berlin (2006) · Zbl 1156.68306
[9] Mathplayer: Speech instructions and examples. http://www.dessci.com/en/products/mathplayer/tech/accessibility.htm
[10] Math Web Search. http://kwarc.info/projects/mws/ . December 2008
[11] Miller, B.: LaTeXML: A L a TEX to xml converter. Web Manual at http://dlmf.nist.gov/LaTeXML/ . September 2007
[12] Stamerjohanns, H., Ginev, D., David, C., Misev, D., Zamdzhiev, V., Kohlhase, M.: Mathml-aware article conversion from L a TEX, a comparison study. In: Sojka, P. (ed.) Towards Digital Mathematics Library, DML 2009 Workshop. Masaryk University, Brno (2009) · Zbl 1176.68233
[13] Stamerjohanns, H., Kohlhase, M.: Transforming the ar{\(\chi\)}iv to XML. In: Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk, F. (eds.) Intelligent Computer Mathematics, 9th International Conference, AISC 2008 15th Symposium, Calculemus 2008 7th International Conference, MKM 2008 Birmingham, UK, 28 July–1 August 2008, Proceedings, number 5144 in LNAI, pp. 574–582. Springer, Berlin (2008)
[14] TeX4ht: LaTeX and TeX for hypertext. http://www.tug.org/applications/tex4ht/mn.html
[15] van den Brand, M., Stuber, J.: Extracting mathematical semantics from latex documents. In: Proceedings of International Workshop on Principles and Practice of Semantic Web Reasoning (PPSWR 2003), number 2901 in LNCS, pp. 160–173, Mumbai, India. Springer, Berlin (2003)
[16] Zentralblatt MATH. http://www.zentralblatt-math.org . October 2009
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.