We present a progress report on our ongoing project of reverse engineering scientific PDF documents. The aim is to obtain mathematical markup that can be used as source for regenerating a document that resembles the original as closely as possible. This source can then be a basis for further document processing. Our current tool uses specialised PDF extraction together with image analysis to produce near perfect input for parsing mathematical formula. Applying a linear grammar and specific drivers for each output format to this input, we can produce an accurate reproduction of formulae when presented with their coordinates. In this paper we will show how this information can be exploited to discover the locations of both inline and display formulae, and also to perform rudimentary layout analysis of the whole document, identifying structures such as headings and paragraphs.
@article{702603, title = {Towards Reverse Engineering of PDF Documents}, booktitle = {Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011}, series = {GDML\_Books}, publisher = {Masaryk University Press}, address = {Brno, Czech Republic}, year = {2011}, pages = {65-75}, url = {http://dml.mathdoc.fr/item/702603} }
Baker, Josef B.; Sexton, Alan P.; Sorge, Volker. Towards Reverse Engineering of PDF Documents, dans Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011, GDML_Books, (2011), pp. 65-75. http://gdmltest.u-ga.fr/item/702603/
Syntax-Directed Recognition of Hand-Printed Two-dimensional Mathematics, Ph.D. thesis, Harvard University, Cambridge, MA (1968). (1968) | Zbl 0207.17806
The LaTeX Companion, Pearson Education, 2e edn. (2005), TeX spacing table, page 525. (2005)
Theory of functions of a real variable, (2005), http://www.math.harvard.edu/~shlomo/docs/Real_Variables.pdf (2005)
A ground-truthed mathematical character and symbol image database, In: Proc. of ICDAR. pp. 675–679. IEEE Computer Society (2005). (2005)
A linear grammar approach to mathematical formula recognition from PDF, In: Proceedings of Intelligent Computer Mathematics (2009). (2009)
Identification of mathematical expressions in document images, In: Document Analysis and Recognition, International Conference on. pp. 1340–1344. IEEE Computer Society, Los Alamitos, CA, USA (2009). (2009)
Faithful mathematical formula recognition from PDF documents, In: 9th IAPR International Workshop on Document Analysis Systems, Extended Abstracts. pp. 485–492. ACM Press, Boston, USA (2010). (2010)
Comparing approaches to mathematical document analysis, In: 11th International Conference on Document Analysis and Recognition (to appear) (2011). (2011)
Infty, (2011), http://www.inftyproject.org (2011)