Using topic models for OCR correction |
| |
Authors: | Faisal Farooq Anurag Bhardwaj Venu Govindaraju |
| |
Affiliation: | (1) Hewlett-Packard Laboratories, Bangalore, India |
| |
Abstract: | Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered
a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten
documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art
word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition
accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition
output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method
by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the
recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic
categorization model to refine the recognition output. We present the relative merits of each of these methods and report
results on the publicly available IAM database. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|