Efficient and flexible text extraction from document pages |
| |
Authors: | Pietro Parodi Roberto Fontana |
| |
Affiliation: | (1) International School for Advanced Studies, Via Beirut 2-4, I-34014 Trieste, Italy; e-mail: parodi@sissa.it , IT |
| |
Abstract: | This paper describes a novel method for extracting text from document pages of mixed content. The method works by detecting
pieces of text lines in small overlapping columns of width , shifted with respect to each other by image elements (good default values are: of the image width, ) and by merging these pieces in a bottom-up fashion to form complete text lines and blocks of text lines. The algorithm requires
about 1.3 s for a 300 dpi image on a PC with a Pentium II CPU, 300 MHz, MotherBoard Intel440LX. The algorithm is largely independent
of the layout of the document, the shape of the text regions, and the font size and style. The main assumptions are that the
background be uniform and that the text sit approximately horizontally. For a skew of up to about 10 degrees no skew correction
mechanism is necessary. The algorithm has been tested on the UW English Document Database I of the University of Washington
and its performance has been evaluated by a suitable measure of segmentation accuracy. Also, a detailed analysis of the segmentation
accuracy achieved by the algorithm as a function of noise and skew has been carried out.
Received April 4, 1999 / Revised June 1, 1999 |
| |
Keywords: | :Text extraction – Document segmentation – Computational complexity – Segmentation accuracy |
本文献已被 SpringerLink 等数据库收录! |
|