Old printed documents represent an important part of our cultural heritage. Their digitalization plays an important role in creating data and metadata. The paper proposed an algorithm for estimation of the global text skew. First, document image is binarized reducing the impact of noise and uneven illumination. The binary image is statistically analyzed and processed. Accordingly, redundant data have been excluded. Furthermore, the convex hulls are established encircling each text object. They are joined establishing connected components. Then, the connected components in complementary image are enlarged with morphological dilation. At the end, the biggest connected component is extracted. Its orientation is similar to the global orientation of text document which is calculated by the moments. Efficiency and correctness of the algorithm are verified by testing on a custom dataset.
Publié le : 2015-10-19
Classification:
other areas of Computing and Informatics,
Document image analysis, image analysis, moments, optical character recognition, statistical analysis, text skew,
68U10; 62H35; 65D18; 46N30
@article{cai1492,
author = {Darko Brodi\'c; University of Belgrade, Technical Faculty in Bor, 19210 Bor and \v Cedomir A. Maluckov; University of Belgrade, Technical Faculty in Bor, 19210 Bor and Liangrui Peng; Tsinghua University, Department of Electronic Engineering, Beijing 100084},
title = {Statistics Oriented Preprocessing of Document Image},
journal = {Computing and Informatics},
volume = {33},
number = {3},
year = {2015},
language = {en},
url = {http://dml.mathdoc.fr/item/cai1492}
}
Darko Brodić; University of Belgrade, Technical Faculty in Bor, 19210 Bor; Čedomir A. Maluckov; University of Belgrade, Technical Faculty in Bor, 19210 Bor; Liangrui Peng; Tsinghua University, Department of Electronic Engineering, Beijing 100084. Statistics Oriented Preprocessing of Document Image. Computing and Informatics, Tome 33 (2015) no. 3, . http://gdmltest.u-ga.fr/item/cai1492/