首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
International Journal on Document Analysis and Recognition (IJDAR) - This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual...  相似文献   

4.
This paper presents an innovative approach called box method for feature extraction for the recognition of handwritten characters. In this method, the binary image of the character is partitioned into a fixed number of subimages called boxes. The features consist of vector distance (γ) from each box to a fixed point. To find γ the vector distances of all the pixels, lying in a particular box, from the fixed point are calculated and added up and normalized by the number of pixels within that box. Here, both neural networks and fuzzy logic techniques are used for recognition and recognition rates are found to be around 97 percent using neural networks and 98 percent using fuzzy logic. The methods are independent of font, size and with minor changes in preprocessing, it can be adopted for any language.  相似文献   

5.
6.
Tuan D. Pham   《Pattern recognition》2003,36(12):3023-3025
A fast and effective algorithm is developed for detecting logos in grayscale document images. The computational schemes involve segmentation, and the calculation of the spatial density of the defined foreground pixels. The detection does not require training and is unconstrained in the sense that the presence of a logo in a document image can be detected under scaling, rotation, translation, and noise. Several tests on different electronic document forms such as letters, faxes, and billing statements are carried out to illustrate the performance of the method.  相似文献   

7.
基于组合分类器的自由手写体数字识别方法   总被引:1,自引:1,他引:0  
自由手写体数字识别广泛应用于信息录入和文本识别中。基于组合分类器实现手写数字的识别,克服了单因子识别的局限性,识别中使用距离法和改进的BP神经网络方法,以多种特征向量作为分类器的输入,以举手法则确定识别输出。实验证明,该系统具有较高的识别率和极低的误识率,有令人鼓舞的应用价值。  相似文献   

8.
9.
Hit lists are at the core of retrieval systems. The top ranks are important, especially if user feedback is used to train the system. Analysis of hit lists revealed counter-intuitive instances in the top ranks for good classifiers. In this study, we propose that two functions need to be optimised: (a) in order to reduce a massive set of instances to a likely subset among ten thousand or more classes, separability is required. However, the results need to be intuitive after ranking, reflecting (b) the prototypicality of instances. By optimising these requirements sequentially, the number of distracting images is strongly reduced, followed by nearest-centroid based instance ranking that retains an intuitive (low-edit distance) ranking. We show that in handwritten word-image retrieval, precision improvements of up to 35 percentage points can be achieved, yielding up to 100% top hit precision and 99% top-7 precision in data sets with 84 000 instances, while maintaining high recall performances. The method is conveniently implemented in a massive scale, continuously trainable retrieval engine, Monk.  相似文献   

10.
We propose a method to perform text searches on handwritten word image databases when no ground-truth data is available to learn models or select example queries. The approach proceeds by synthesizing multiple images of the query string using different computer fonts. While this idea has been successfully applied to printed documents in the past, its application to the handwritten domain is not straightforward. Indeed, the domain mismatch between queries (synthetic) and database images (handwritten) leads to poor accuracy.Our solution is to represent the queries with robust features and use a model that explicitly accounts for the domain mismatch. While the model is trained using synthetic images, its generative process produces samples according to the distribution of handwritten features. Furthermore, we propose an unsupervised method to perform font selection which has a significant impact on accuracy. Font selection is formulated as finding an optimal weighted mixture of fonts that best approximates the distribution of handwritten low-level features. Experiments demonstrate that the proposed method is an effective way to perform queries without using any human annotated example in any part of the process.  相似文献   

11.
This paper presents the main current theoretical issues in Information Retrieval. The principles of conceptual modelling, as they have emerged in the database area, are presented and their application to document modelling in order to enhance document retrieval is discussed. Finally, the main features of the MULTOS project are presented and critically reviewed confronting them with the requirements which have been identified during the general discussion on document conceptual modelling for information retrieval.  相似文献   

12.
Ranking functions are an important component of information retrieval systems. Recently there has been a surge of research in the field of “learning to rank”, which aims at using labeled training data and machine learning algorithms to construct reliable ranking functions. Machine learning methods such as neural networks, support vector machines, and least squares have been successfully applied to ranking problems, and some are already being deployed in commercial search engines.Despite these successes, most algorithms to date construct ranking functions in a supervised learning setting, which assume that relevance labels are provided by human annotators prior to training the ranking function. Such methods may perform poorly when human relevance judgments are not available for a wide range of queries. In this paper, we examine whether additional unlabeled data, which is easy to obtain, can be used to improve supervised algorithms. In particular, we investigate the transductive setting, where the unlabeled data is equivalent to the test data.We propose a simple yet flexible transductive meta-algorithm: the key idea is to adapt the training procedure to each test list after observing the documents that need to be ranked. We investigate two instantiations of this general framework: The Feature Generation approach is based on discovering more salient features from the unlabeled test data and training a ranker on this test-dependent feature-set. The importance weighting approach is based on ideas in the domain adaptation literature, and works by re-weighting the training data to match the statistics of each test list. We demonstrate that both approaches improve over supervised algorithms on the TREC and OHSUMED tasks from the LETOR dataset.  相似文献   

13.
14.
Object-Fuzzy Concept Network (O-FCN) is a recent knowledge representation model to integrate Fuzzy Ontologies in Information Retrieval systems. O-FCNs handle huge data collections and have to face the inherent complexity of semantic manipulation during the retrieval process. Therefore their distribution is an essential requirement to reach good scalability. We present ‘Grid2Peer’: a distributed architecture for O-FCN-based semantic information retrieval that exploits the self-organization characteristics of both Grid and P2P systems. The most relevant features in Grid2Peer are the adoption of the fuzzy sets to organize the overlay itself, the capability of migrating knowledge towards the location where it is accessed, and granting dynamic load balancing among peers. Numerical simulations are performed in order to analyze these characteristics, evaluating also fuzzy precision and fuzzy recall measures given by the distributed retrieval algorithm for the Grid2Peer architecture.  相似文献   

15.
16.
Information retrieval in document image databases   总被引:2,自引:0,他引:2  
With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval.  相似文献   

17.
Multimedia Tools and Applications - Handwritten document image dataset is one of the basic necessities to conduct research on developing Optical Character Recognition (OCR) systems. In a...  相似文献   

18.
19.
A new technology for intelligent full text document retrieval is presented. The retrieval of a document is treated as an expert system problem, recognizing that human document retrieval is expert behavior. The technology is semantic measurement. A working prototype system, LIBRARY, has been built based on the technology. Input is a request for information, in unrestricted technical English; output is all documents with measured content similar to that of the request, ranked in order of relevance. Retrieval is unaffected by similarity or dissimilarity of terms between request and document. LIBRARY's performance is comparable to that of an expert human librarian, representing a significant improvement over traditional document retrieval systems.  相似文献   

20.
Imaged document text retrieval without OCR   总被引:6,自引:0,他引:6  
We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely the vertical traverse density (VTD) and horizontal traverse density (HTD), are extracted. An n-gram-based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号