Mathematical Document Classification via Symbol Frequency Analysis
Watt, Stephen M.
Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008, GDML_Books, (2008), p. 29-40 / Harvested from

Earlier work has examined the frequency of symbol and expression use in mathematical documents for various purposes including mathematical handwriting recognition and forming the most natural output from computer algebra systems. This work has found, unsurprisingly, that the particulars of symbol and expression vary from area to area and, in particular, between different top-level subjects of the 2000 Mathematical Subject Classification. If the area of mathematics is known in advance, then an area-specific information can be used for the recognition or output problem. What is more interesting is that although the specifics of which symbols are ranked as most frequent vary from area to area, the shape of the relative frequency curve remains the same. The present work examines the inverse problem: Given the relative frequencies of symbols in a document, is it possible to classify the document and determine the most likely area of mathematics of the work? We examine the symbol frequency “fingerprints” for the different areas of the Mathematical Subject Classification.

EUDML-ID : urn:eudml:doc:220164
Mots clés:
Mots clés:
@article{702543,
     title = {Mathematical Document Classification via~Symbol~Frequency~Analysis},
     booktitle = {Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008},
     series = {GDML\_Books},
     publisher = {Masaryk University},
     address = {Brno},
     year = {2008},
     pages = {29-40},
     zbl = {1170.68494},
     url = {http://dml.mathdoc.fr/item/702543}
}
Watt, Stephen M. Mathematical Document Classification via Symbol Frequency Analysis, dans Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008, GDML_Books,  (2008), pp. 29-40. http://gdmltest.u-ga.fr/item/702543/

arXiv e-Print archive, , http://arxiv.org.

2000 Mathematics Subject Classification, . American Mathematical Society, http://www.ams.org/msc.

Clare, M. So; Watt, S. M. Determining Empirical Properties of Mathematical Expression Use, , Proc. Fourth International Conference on Mathematical Knowledge Management, (MKM 2005), July 15–17, 2005, Bremen Germany, Springer Verlag LNCS 3863, pp. 361–375.

Clare, M. So An Analysis of Mathematical Expressions Used in Practice, , Masters Thesis, University of Western Ontario, 2005.

Zipf, G. K. Human Behavior and the Principle of Least-Effort, , Addison-Wesley, 1949. (1949)

Greenberg, M. Advanced Engineering Mathematics, 2nd ed., , Prentice Hall 1998. (1998)

Kreyszig, E. Advanced Engineering Mathematics, 8th ed., , Wiley & Sons 1999. (1999) | MR 1665766

Watt, S. M. Exploiting Implicit Mathematical Semantics in Conversion between TeX and MathML, , Proc. Internet Accessible Mathematical Communication,http://www.symbolicnet.org/conferences/iamc02, July 7, 2002, Lille, France. (2002)

O’Neil, P. Advanced Engineering Mathematics, 5th ed., , Thomson-Nelson 2003. (2003)

Suzuki, M.; Tamari, F.; Fukuda, R.; Uchida, S.; Kanahori, T. Infty—an integrated OCR system for mathematical documents, , Proceedings of ACM Symposium on Document Engineering 2003, Grenoble, 2003, pp. 95–104. (2003)

Garain, U.; Chaudhuri, B. B. A corpus for OCR research on mathematical expressions, , International Journal on Document Analysis and Recognition, Vol. 7, Issue 4, pp. 241–259. (September 2005). (2005)

Uchida, S.; Nomura, A.; Suzuki, M. Quantitative analysis of mathematical documents, , International Journal on Document Analysis and Recognition, Vol. 7, Issue 4, pp. 211–218. (September 2005). (2005)

Kreyszig, E. Advanced Engineering Mathematics, 9th ed., , Wiley & Sons 2006. (2006)

Watt, S. M. An Empirical Measure on the Set of Symbols Occurring in Engineering Mathematics Texts, , Proc. 8th IAPR International Workshop on Document Analysis Systems, (DAS 2008), Sept 17–19, 2008, Nara, Japan, (IEEE, to appear). (2008)

Smirnova, E.; Watt, S. M. Context-Sensitive Mathematical Character Recognition, , August 19–21, 2008, Montreal, Canada, (IEEE, to appear). (2008)