Workflow of Metadata Extraction from Retro-Born Digital Documents
Tkaczyk, Dominika ; Bolikowski, Łukasz
Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011, GDML_Books, (2011), p. 39-44 / Harvested from

In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.

EUDML-ID : urn:eudml:doc:221804
Mots clés:
Mots clés:
@article{702601,
     title = {Workflow of Metadata Extraction from Retro-Born Digital Documents},
     booktitle = {Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011},
     series = {GDML\_Books},
     publisher = {Masaryk University Press},
     address = {Brno, Czech Republic},
     year = {2011},
     pages = {39-44},
     url = {http://dml.mathdoc.fr/item/702601}
}
Tkaczyk, Dominika; Bolikowski, Łukasz. Workflow of Metadata Extraction from Retro-Born Digital Documents, dans Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011, GDML_Books,  (2011), pp. 39-44. http://gdmltest.u-ga.fr/item/702601/

iText, http://itextpdf.com/.

MARG, http://marg.nlm.nih.gov/. | Zbl 1143.68407

PDFBox, http://pdfbox.apache.org/

Nagy, G.; Seth, S.; Viswanathan, M. A prototype document image analysis system for technical journals, Computer 25(7), 10–22 (1992). (1992)

O’Gorman, L. The document spectrum for page layout analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993). (1993)

Automating the production of bibliographic records for MEDLINE, Tech. rep. (2001). (2001)

Sutton, C.; McCallum, A. An Introduction to Conditional Random Fields for Relational Learning, (2006). (2006)

Hetzner, E. A simple method for citation metadata extraction using Hidden Markov Models, In: JCDL ’08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries. pp. 280–284. ACM, New York, NY, USA (2008). (2008)

Marinai, S. Metadata Extraction from PDF Papers for Digital Library Ingest, 10th International Conference on Document Analysis and Recognition. pp. 251–255 (2009). (2009)

Sojka, P. An Experience with Building Digital Open Access Repository DML-CZ, In: Proceedings of CASLIN 2009. pp. 74–78 (2009). (2009)

Cui, B.; Chen, X. An improved hidden Markov model for literature metadata extraction, Advanced Intelligent Computing Theories and Applications. pp. 205–212 (2010). (2010)