Document cleanup using page frame detection |
| |
Authors: | Faisal Shafait Joost van Beusekom Daniel Keysers Thomas M Breuel |
| |
Affiliation: | (1) Image Understanding and Pattern Recognition (IUPR) Research Group, German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany;(2) Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany |
| |
Abstract: | When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual
noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle
non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual
noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing
document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document
image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents
area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame
of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the
algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used.
Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the
benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval
as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page
frame, the OCR error rate is reduced from 4.3 to
1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases
the retrieval error rates by 30%. |
| |
Keywords: | Document analysis Marginal noise removal Document pre-processing |
本文献已被 SpringerLink 等数据库收录! |
|