首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Transforming paper documents into XML format with WISDOM++   总被引:1,自引:1,他引:0  
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported. Received June 15, 2000 / Revised November 7, 2000  相似文献   

2.
3.
Both of an automatic classification method for original documents based on image feature and a layout analysis method based on rule hypothesis tree are proposed. Then an intelligent document-filling system by electronizing the original documents, which can be applied to cellphones and pads is designed. When users are filling documents online, information can be automatically input to the financial information system merely by taking photos of the original documents. By this means can not only save time but also ensure the accuracy between the data online and that on the original documents. Experiments show that the accuracy of document classification is 88.38%, the accuracy of document-filling is 87.22%, and it takes 5.042 seconds dealing with per document. The system can be applied to financial, government, libraries, electric power, enterprises and many other industries, which has high economic and application value.  相似文献   

4.
5.
Retrieving information from document images: problems and solutions   总被引:1,自引:0,他引:1  
An information retrieval system that captures both visual and textual contents from paper documents can derive maximal benefits from DAR techniques while demanding little human assistance to achieve its goals. This article discusses technical problems, along with solution methods, and their integration into a well-performing system. The focus of the discussion is very difficult applications, for example, Chinese and Japanese documents. Solution methods are also highlighted, with the emphasis placed upon some new ideas, including window-based binarization using scale measures, document layout analysis for solving the multiple constraint problem, and full-text searching techniques capable of evading machine recognition errors. Received May 25, 2000 / Revised November 7, 2000  相似文献   

6.
Automatic text segmentation and text recognition for video indexing   总被引:13,自引:0,他引:13  
Efficient indexing and retrieval of digital video is an important function of video databases. One powerful index for retrieval is the text appearing in them. It enables content-based browsing. We present our new methods for automatic segmentation of text in digital videos. The algorithms we propose make use of typical characteristics of text in videos in order to enable and enhance segmentation performance. The unique features of our approach are the tracking of characters and words over their complete duration of occurrence in a video and the integration of the multiple bitmaps of a character over time into a single bitmap. The output of the text segmentation step is then directly passed to a standard OCR software package in order to translate the segmented text into ASCII. Also, a straightforward indexing and retrieval scheme is introduced. It is used in the experiments to demonstrate that the proposed text segmentation algorithms together with existing text recognition algorithms are suitable for indexing and retrieval of relevant video sequences in and from a video database. Our experimental results are very encouraging and suggest that these algorithms can be used in video retrieval applications as well as to recognize higher level semantics in videos.  相似文献   

7.
Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify it by type in the absence of domain-specific models. Our approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class. We use image features such as percentages of text and non-text (graphics, images, tables, and rulings) content regions, column structures, relative point sizes of fonts, density of content area, and statistics of features of connected components which can be derived without class knowledge. In order to obtain class labels for training samples, we conducted a study where subjects ranked document pages with respect to their resemblance to representative page images. Class labels can also be assigned based on known document types, or can be defined by the user. We implemented our classification scheme using decision tree classifiers and self-organizing maps. Received June 15, 2000 / Revised November 15, 2000  相似文献   

8.
In the literature, many feature types are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not yet been done. In particular, evaluations on OCR documents are very rare. In this paper we investigate seven text representations based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts. Received February 17, 1998 / Revised April 8, 1998  相似文献   

9.
Detection, segmentation, and classification of specific objects are the key building blocks of a computer vision system for image analysis. This paper presents a unified model-based approach to these three tasks. It is based on using unsupervised learning to find a set of templates specific to the objects being outlined by the user. The templates are formed by averaging the shapes that belong to a particular cluster, and are used to guide a probabilistic search through the space of possible objects. The main difference from previously reported methods is the use of on-line learning, ideal for highly repetitive tasks. This results in faster and more accurate object detection, as system performance improves with continued use. Further, the information gained through clustering and user feedback is used to classify the objects for problems in which shape is relevant to the classification. The effectiveness of the resulting system is demonstrated in two applications: a medical diagnosis task using cytological images, and a vehicle recognition task. Received: 5 November 2000 / Accepted: 29 June 2001 Correspondence to: K.-M. Lee  相似文献   

10.
The most noticeable characteristic of a construction tender document is that its hierarchical architecture is not obviously expressed but is implied in the citing information. Currently available methods cannot deal with such documents. In this paper, the intra-page and inter-page relationships are analyzed in detail. The creation of citing relationships is essential to extracting the logical structure of tender documents. The hierarchy of tender documents naturally leads to extracting and displaying the logical structure as tree structure. This method is successfully implemented in VHTender, and is the key to the efficiency and flexibility of the whole system. Received February 28, 2000 / Revised October 20, 2000  相似文献   

11.
This paper describes an intelligent information system for effectively managing huge amounts of online text documents (such as Web documents) in a hierarchical manner. The organizational capabilities of this system are able to evolve semi-automatically with minimal human input. The system starts with an initial taxonomy in which documents are automatically categorized, and then evolves so as to provide a good indexing service as the document collection grows or its usage changes. To this end, we propose a series of algorithms that utilize text-mining technologies such as document clustering, document categorization, and hierarchy reorganization. In particular, clustering and categorization algorithms have been intensively studied in order to provide evolving facilities for hierarchical structures and categorization criteria. Through experiments using the Reuters-21578 document collection, we evaluate the performance of the proposed clustering and categorization methods by comparing them to those of well-known conventional methods.  相似文献   

12.
This paper presents the current state of the A2iA CheckReaderTM – a commercial bank check recognition system. The system is designed to process the flow of payment documents associated with the check clearing process: checks themselves, deposit slips, money orders, cash tickets, etc. It processes document images and recognizes document amounts whatever their style and type – cursive, hand- or machine printed – expressed as numerals or as phrases. The system is adapted to read payment documents issued in different English- or French-speaking countries. It is currently in use at more than 100 large sites in five countries and processes daily over 10 million documents. The average read rate at the document level varies from 65 to 85% with a misread rate corresponding to that of a human operator (1%). Received October 13, 2000 / Revised December 4, 2000  相似文献   

13.
Abstract. For some multimedia applications, it has been found that domain objects cannot be represented as feature vectors in a multidimensional space. Instead, pair-wise distances between data objects are the only input. To support content-based retrieval, one approach maps each object to a k-dimensional (k-d) point and tries to preserve the distances among the points. Then, existing spatial access index methods such as the R-trees and KD-trees can support fast searching on the resulting k-d points. However, information loss is inevitable with such an approach since the distances between data objects can only be preserved to a certain extent. Here we investigate the use of a distance-based indexing method. In particular, we apply the vantage point tree (vp-tree) method. There are two important problems for the vp-tree method that warrant further investigation, the n-nearest neighbors search and the updating mechanisms. We study an n-nearest neighbors search algorithm for the vp-tree, which is shown by experiments to scale up well with the size of the dataset and the desired number of nearest neighbors, n. Experiments also show that the searching in the vp-tree is more efficient than that for the -tree and the M-tree. Next, we propose solutions for the update problem for the vp-tree, and show by experiments that the algorithms are efficient and effective. Finally, we investigate the problem of selecting vantage-point, propose a few alternative methods, and study their impact on the number of distance computation. Received June 9, 1998 / Accepted January 31, 2000  相似文献   

14.
With respect to the specific requirements of advanced OODB applications, index data structures for type hierarchies in OODBMS have to provide efficient support for multiattribute queries and have to allow index optimization for a particular query profile. We describe the multikey type index and an efficient implementation of this indexing scheme. It meets both requirements: in addition to its multiattribute query capabilities it is designed as a mediator between two standard design alternatives, key-grouping and type-grouping. A prerequisite for the multikey type index is a linearization algorithm which maps type hierarchies to linearly ordered attribute domains in such a way that each subhierarchy is represented by an interval of this domain. The algorithm extends previous results with respect to multiple inheritance. The subsequent evaluation of our proposal focuses on storage space overhead as well as on the number of disk I/O operations needed for query execution. The analytical results for the multikey type index are compared to previously published figures for well-known single-key search structures. The comparison clearly shows the superiority of the multikey type index for a large class of query profiles. Edited by E. Bertino. Received October 7, 1996 / Accepted March 28, 1997  相似文献   

15.
Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.  相似文献   

16.
Geometric groundtruth at the character, word, and line levels is crucial for designing and evaluating optical character recognition (OCR) algorithms. Kanungo and Haralick proposed a closed-loop methodology for generating geometric groundtruth for rescanned document images. The procedure assumed that the original image and the corresponding groundtruth were available. It automatically registered the original image to the rescanned one using four corner points and then transformed the original groundtruth using the estimated registration transformation. In this paper, we present an attributed branch-and-bound algorithm for establishing the point correspondence that uses all the data points. We group the original feature points into blobs and use corners of blobs for matching. The Euclidean distance between character centroids is used as the error metric. We conducted experiments on synthetic point sets with varying layout complexity to characterize the performance of two matching algorithms. We also report results on experiments conducted using the University of Washington dataset. Finally, we show examples of application of this methodology for generating groundtruth for microfilmed and FAXed versions of the University of Washington dataset documents. Received: July 24, 2001 / Accepted: May 20, 2002  相似文献   

17.
This paper describes aminimally immersive three-dimensional volumetric interactive information visualization system for management and analysis of document corpora. The system, SFA, uses glyph-based volume rendering, enabling more complex data relationships and information attributes to be visualized than traditional 2D and surface-based visualization systems. Two-handed interaction using three-space magnetic trackers and stereoscopic viewing are combined to produce aminimally immersive interactive system that enhances the user’s three-dimensional perception of the information space. This new system capitalizes on the human visual system’s pre-attentive learning capabilities to quickly analyze the displayed information. SFA is integrated with adocument management and information retrieval engine named Telltale. Together, these systems integrate visualization and document analysis technologies to solve the problem of analyzing large document corpora. We describe the usefulness of this system for the analysis and visualization of document similarity within acorpus of textual documents, and present an example exploring authorship of ancient Biblical texts. Received: 15 December 1997 / Revised: June 1999  相似文献   

18.
Automatic character recognition and image understanding of a given paper document are the main objectives of the computer vision field. For these problems, a basic step is to isolate characters and group words from these isolated characters. In this paper, we propose a new method for extracting characters from a mixed text/graphic machine-printed document and an algorithm for distinguishing words from the isolated characters. For extracting characters, we exploit several features (size, elongation, and density) of characters and propose a characteristic value for classification using the run-length frequency of the image component. In the context of word grouping, previous works have largely been concerned with words which are placed on a horizontal or vertical line. Our word grouping algorithm can group words which are on inclined lines, intersecting lines, and even curved lines. To do this, we introduce the 3D neighborhood graph model which is very useful and efficient for character classification and word grouping. In the 3D neighborhood graph model, each connected component of a text image segment is mapped onto 3D space according to the area of the bounding box and positional information from the document. We conducted tests with more than 20 English documents and more than ten oriental documents scanned from books, brochures, and magazines. Experimental results show that more than 95% of words are successfully extracted from general documents, even in very complicated oriental documents. Received August 3, 2001 / Accepted August 8, 2001  相似文献   

19.
The next generation of interactive multimedia documents can contain both static media, e.g., text, graph, image, and continuous media, e.g., audio and video, and can provide user interactions in distributed environments. However, the temporal information of multimedia documents cannot be described using traditional document structures, e.g., Open Document Architecture (ODA) and Standard Generalized Mark-up Language (SGML); the continuous transmission of media units also raises some new synchronization problems, which have not been met before, for processing user interactions. Thus, developing a distributed interactive multimedia document system should resolve the issues of document model, presentation control architecture, and control scheme. In this paper, we (i) propose a new multimedia document model that contains the logical structure, the layout structure, and the temporal structure to formally describe multimedia documents, and (ii) point out main interaction-based synchronization problems, and propose a control architecture and a token-based control scheme to solve these interaction-based synchronization problems. Based on the proposed document model, control architecture, and control scheme, a distributed interactive multimedia document development mechanism, which is called MING-I, is developed on SUN workstations.  相似文献   

20.
Machine Learning for Intelligent Processing of Printed Documents   总被引:1,自引:0,他引:1  
A paper document processing system is an information system component which transforms information on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. This article proposes the application of machine learning techniques to acquire the specific knowledge required by an intelligent document processing system, named WISDOM++, that manages printed documents, such as letters and journals. Knowledge is represented by means of decision trees and first-order rules automatically generated from a set of training documents. In particular, an incremental decision tree learning system is applied for the acquisition of decision trees used for the classification of segmented blocks, while a first-order learning system is applied for the induction of rules used for the layout-based classification and understanding of documents. Issues concerning the incremental induction of decision trees and the handling of both numeric and symbolic data in first-order rule learning are discussed, and the validity of the proposed solutions is empirically evaluated by processing a set of real printed documents.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号