首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
When using OCR equipment (OCR = Optical Character Recognition) it is necessary to define shape and print quality of the characters in national and international standards.The paper gives a review of the standardization of the character fonts OCR A and OCR B for optical character recognition. All relevant parameters are discussed which define the print quality of printed characters. A critical review of two measurement methods used so far for the evaluation of print quality parameters is given. The improvement of the measurement methods led to the development of an automatic measurement device which is described in detail.The automatic measurement device consists of a high precision scanning device to digitize the printed characters and a computer program for the evaluation of the data. The mechanical construction and the properties of the scanning device are explained. The computer program consists of two main parts for preprocessing of the scanned data and for the evaluation of the different print parameters.Test results are reported and possible applications of the measurement device are discussed.  相似文献   

2.

Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Post-processing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition.

  相似文献   

3.
汉字具有丰富的字体类型,并且不同的字体在汉字结构上有显著的不同,现在的OCR技术侧重字的识别,而对字体识别的关注较少。提出文字相关的单字符字体识别方法,利用文字相关的先验信息及字体结构特征,对字体的相似性度量采用向量空间模型,并针对常用66款简体字进行实验,得到了较好的平均识别率。  相似文献   

4.
Despite ubiquitous claims that optical character recognition (OCR) is a “solved problem,” many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-computed or stored character models, but these are vulnerable to cases when the font of a particular document was not part of the training set or when there is so much noise in a document that the font model becomes weak. To address these difficult cases, we present a form of iterative contextual modeling that learns character models directly from the document it is trying to recognize. We use these learned models both to segment the characters and to recognize them in an incremental, iterative process. We present results comparable with those of a commercial OCR system on a subset of characters from a difficult test document in both English and Greek.  相似文献   

5.
In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.  相似文献   

6.
An integrated OCR system for Farsi text is proposed. The system uses information from several knowledge sources (KSs) and manages them in a blackboard approach. Some KSs like classifiers are acquired a priori through an offline training process while others like statistical features are extracted online while recognizing. An arbiter controls the interactions between the solution blackboard and KSs. The system has been tested on 20 real-life scanned documents with ten popular Farsi fonts and a recognition rate of 97.05% in word level and 99.03% in character level has been achieved. An erratum to this article can be found at  相似文献   

7.
8.
Optical character recognition (OCR) refers to a process whereby printed documents are transformed into ASCII files for the purpose of compact storage, editing, fast retrieval, and other file manipulations through the use of a computer. The recognition stage of an OCR process is made difficult by added noise, image distortion, and the various character typefaces, sizes, and fonts that a document may have. In this study a neural network approach is introduced to perform high accuracy recognition on multi-size and multi-font characters; a novel centroid-dithering training process with a low noise-sensitivity normalization procedure is used to achieve high accuracy results. The study consists of two parts. The first part focuses on single size and single font characters, and a two-layered neural network is trained to recognize the full set of 94 ASCII character images in 12-pt Courier font. The second part trades accuracy for additional font and size capability, and a larger two-layered neural network is trained to recognize the full set of 94 ASCII character images for all point sizes from 8 to 32 and for 12 commonly used fonts. The performance of these two networks is evaluated based on a database of more than one million character images from the testing data set  相似文献   

9.
10.
鲁棒的多体印刷英文识别系统的实现   总被引:6,自引:1,他引:5  
文章讨论了设计一个实用的多体英文识别系统中解决的主要问题。该系统能识别多达260种字体,包括斜体和黑体等字体,对训练集的识别率达到99%,对实际文本测试的错误率比TH-OCR2000低56%。文章详细阐述了文本行字切分,特征提取和分类器设计,以及后处理所使用的常用技术,对各种技术的特点进行了分析和比较,并提出了一些新的技术。文章对于OCR系统的设计具有一定的指导意义。  相似文献   

11.
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. Ee describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document  相似文献   

12.
在对文档图像进行光学字符识别时,由于书籍扭曲的存在,识别率会降低。对于 含有页眉页脚线的扭曲文档图像,提出一种快速校正方法。首先分别检测并定位图像中的页眉 线,保存页眉线的坐标信息。根据等比算法计算页眉线上各点在校正时所需向上或向下移动的 距离,然后以此距离为参数扫描图像,计算页眉页脚线之间的各个目标像素校正所需移动的距 离,同时进行像素点的移动重构图像,最终得到校正的图像。实验结果表明,该方法校正效果明显, 对于包含页眉页脚线的扭曲文档图像有较好的校正效果,校正后OCR 识别率大幅度提高。  相似文献   

13.
14.
When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.  相似文献   

15.
International Journal on Document Analysis and Recognition (IJDAR) - Optical character recognition (OCR) is the process of recognizing characters automatically from scanned documents for editing,...  相似文献   

16.
矢量字库的设计与实现   总被引:1,自引:0,他引:1  
矢量字库是使用矢量文字的主要方式。本文以一种轮廓文字位图矢量化的方法为基础,详细介绍了从文字住图数据抓取、位图轮廓提取、矢量化数据生成、到矢量化字库生成、读取及显示的设计与实现方法。  相似文献   

17.
探索了一种以打印件鉴别打印机型的文字图像计算机模糊识别方法.该方法收集标准常用字号和字体,以及常用打印机打印的文字,扫描采集,用改进的直方图波形分析法处理图像,提取文字的笔画总面积和笔画轮廓总周长等特征指标;再选定一种机型为参照,对各种机型相同字上述指标测量值及其几种组合的计算值,形成相对差值指标序列,建立信息数据库.在此基础上,建立对应指标的统计均值波动区间的值域表,并确定各指标的权重和建立权重系数矩阵.判断未知机型时,先按照前述方法任测100个常用字,利用OCR汉字识别模块和前述指标,自动辨识文字,进入模糊识别过程.根据相应检测字在值域表区间出现的概率,建立模糊关系矩阵.通过两个矩阵乘积的模糊变换产生判别矩阵.以最大隶属性确定打印机类型.按照数学模型,设计并实现打印机智能鉴别程序.应用实例测试,结果显示判别准确,符合设计预期.  相似文献   

18.
In this paper, we present a system that automatically translates Arabic text embedded in images into English. The system consists of three components: text detection from images, character recognition, and machine translation. We formulate the text detection as a binary classification problem and apply gradient boosting tree (GBT), support vector machine (SVM), and location-based prior knowledge to improve the F1 score of text detection from 78.95% to 87.05%. The detected text images are processed by off-the-shelf optical character recognition (OCR) software. We employ an error correction model to post-process the noisy OCR output, and apply a bigram language model to reduce word segmentation errors. The translation module is tailored with compact data structure for hand-held devices. The experimental results show substantial improvements in both word recognition accuracy and translation quality. For instance, in the experiment of Arabic transparent font, the BLEU score increases from 18.70 to 33.47 with use of the error correction module.  相似文献   

19.
An omnifont open-vocabulary OCR system for English and Arabic   总被引:2,自引:0,他引:2  
We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on hidden Markov models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. We focus on two aspects of the OCR system. First, we address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved character error rates of 1.1 percent on data from the University of Washington English Document Image Database and 3.3 percent on data from the DARPA Arabic OCR Corpus  相似文献   

20.
Because neural networks specialize in handling ambiguous data, they are especially suited for such applications as speech recognition and optical character recognition (OCR). OCR applications are usually ambiguous because their data is generated by an inconsistent factor—the individual. This article provides an overview of neural networks and describes how this technology can be integrated with OCR technology to create neural OCR networks that can significantly improve the process of optical character recognition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号