首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Transforming paper documents into XML format with WISDOM++   总被引:1,自引:1,他引:0  
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported. Received June 15, 2000 / Revised November 7, 2000  相似文献   

2.
When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.  相似文献   

3.
ContextAdvances in customization have highlighted the need for tools supporting variable content document management and generation in many domains. Current tools allow the generation of highly customized documents that are variable in both content and layout. However, most frameworks are technology-oriented, and their use requires advanced skills in implementation-related tools, which means their use by end users (i.e. document designers) is severely limited.ObjectiveStarting from past and current trends for customized document authoring, our goal is to provide a document generation alternative in which variants are specified at a high level of abstraction and content reuse can be maximized in high variability scenarios.MethodBased on our experience in Document Engineering, we identified areas in the variable content document management and generation field open to further improvement. We first classified the primary sources of variability in document composition processes and then developed a methodology, which we called DPL – based on Software Product Lines principles – to support document generation in high variability scenarios.ResultsIn order to validate the applicability of our methodology we implemented a tool – DPLfw – to carry out DPL processes. After using this in different scenarios, we compared our proposal with other state-of-the-art tools for variable content document management and generation.ConclusionThe DPLfw showed a good capacity for the automatic generation of variable content documents equal to or in some cases surpassing other currently available approaches. To the best of our knowledge, DPLfw is the only framework that combines variable content and document workflow facilities, easing the generation of variable content documents in which multiple actors play different roles.  相似文献   

4.
印刷文献信息采集处理是文本信息处理应用,特别是数字化图书馆建设中十分繁重而又必须从事的工作。由于目前广泛使用的字符光学识别系统(OCR)无法对具有偏斜角度的扫描文本图象进行自动加工处理,所以需要大量的人工介入,即以手工方法纠正图象偏斜。因为无法有效地进行扫描文本集的批量处理,所以难以提高处理效率。针对这一问题,在讨论文本图象轮廓投影性质的基础上,利用其相关系数与文本偏斜角的统计依赖关系,构造了一个用于文本图象的自动偏斜纠正方法。  相似文献   

5.
Emerging applications such as personalized portals, enterprise search, and web integration systems often require keyword search over semi-structured views. However, traditional information retrieval techniques are likely to be expensive in this context because they rely on the assumption that the set of documents being searched is materialized. In this paper, we present a system architecture and algorithm that can efficiently evaluate keyword search queries over virtual (unmaterialized) XML views. An interesting aspect of our approach is that it exploits indices present on the base data and thereby avoids materializing large parts of the view that are not relevant to the query results. Another feature of the algorithm is that by solely using indices, we can still score the results of queries over the virtual view, and the resulting scores are the same as if the view was materialized. Our performance evaluation using the INEX data set in the Quark (Bhaskar et al. in Quark: an efficient XQuery full-text implementation. In: SIGMOD, 2006) open-source XML database system indicates that the proposed approach is scalable and efficient.  相似文献   

6.
Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters.We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method’s performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests.  相似文献   

7.
Access to legal information and, in particular, to legal literature is examined for the creation of a search and retrieval system for Italian legal literature. The design and implementation of services such as integrated access to a wide range of resources are described, with a particular focus on the importance of exploiting metadata assigned to disparate legal material. The integration of structured repositories and Web documents is the main purpose of the system: it is constructed on the basis of a federation system with service provider functions, aiming at creating a centralized index of legal resources. The index is based on a uniform metadata view created for structured data by means of the OAI approach and for Web documents by a machine learning approach, which, in this paper, has been assessed as regards document classification. Semantic searching is a major requirement for legal literature users and a solution based on the exploitation of Dublin Core metadata, as well as the use of legal ontologies and related terms prepared for accessing indexed articles have been implemented.
E. FrancesconiEmail:
  相似文献   

8.
This paper presents a method for determining the up/down orientation of text in a scanned document of unknown orientation, so that it can be appropriately rotated and processed by an optical character recognition (OCR) engine. The method analyzes the “open” portions of text blobs to determine the direction in which the open portions face. By determining the respective densities of blobs opening in a pair of opposite directions (e.g., right or left), the method can establish the direction in which the text as a whole is oriented. We first describe a method for determining the up/down orientation of roman text based on the asymmetry in the openness of most roman letters in the horizontal direction. For non-roman text such as Pashto and Hebrew, we provide a method that determines a direction that is the most asymmetric, and therefore the most useful for the determination of text orientation, given a training data set of documents of known orientation. This work can be adapted for use in automated mail processing or to determine the orientation of checks in automated teller machine envelopes, scanned or copied documents, documents sent via facsimile, and digital photographs that include text (e.g., road signs, business cards, driver's licenses), among other applications.  相似文献   

9.
Structure analysis of table form documents is an important issue because a printed document and even an electronic document do not provide logical structural information but merely geometrical layout and lexical information. To handle these documents automatically, logical structure information is necessary. In this paper, we first analyze the elements of the form documents from a communication point of view and retrieve the grammatical elements that appear in them. Then, we present a document structure grammar which governs the logical structure of the form documents. Finally, we propose a structure analysis system of the table form documents based on the grammar. By using grammar notation, we can easily modify and keep it consistent, as the rules are relatively simple. Another advantage of using grammar notation is that it can be used for generating documents only from logical structure. In our system, documents are assumed to be composed of a set of boxes and they are classified as seven box types. Then the box relations between the indication box and its associated entry box are analyzed based on the semantic and geometric knowledge defined in the document structure grammar. Experimental results have shown that the system successfully analyzed several kinds of table forms.  相似文献   

10.
Extraction of sequences of events from news and other documents based on the publication times of these documents has been shown to be extremely effective in tracking past events. This paper addresses the issue of constructing an optimal information preserving decomposition of the time period associated with a given document set, i.e., a decomposition with the smallest number of subintervals, subject to no loss of information. We introduce the notion of the compressed interval decomposition, where each subinterval consists of consecutive time points having identical information content. We define optimality, and show that any optimal information preserving decomposition of the time period is a refinement of the compressed interval decomposition. We define several special classes of measure functions (functions that measure the prevalence of keywords in the document set and assign them numeric values), based on their effect on the information computed as document sets are combined. We give algorithms, appropriate for different classes of measure functions, for computing an optimal information preserving decomposition of a given document set. We studied the effectiveness of these algorithms by computing several compressed interval and information preserving decompositions for a subset of the Reuters–21578 document set. The experiments support the obvious conclusion that the temporal information gleaned from a document set is strongly dependent on the measure function used and on other user-defined parameters.
Daniel J. RosenkrantzEmail:
  相似文献   

11.
Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345–366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much.  相似文献   

12.

In a Common Law system, legal practitioners need frequent access to prior case documents that discuss relevant legal issues. Case documents are generally very lengthy, containing complex sentence structures, and reading them fully is a strenuous task even for legal practitioners. Having a concise overview of these documents can relieve legal practitioners from the task of reading the complete case statements. Legal catchphrases are (multi-word) phrases that provide a concise overview of the contents of a case document, and automated generation of catchphrases is a challenging problem in legal analytics. In this paper, we propose a novel supervised neural sequence tagging model for the extraction of catchphrases from legal case documents. Specifically, we show that incorporating document-specific information along with a sequence tagging model can enhance the performance of catchphrase extraction. We perform experiments over a set of Indian Supreme Court case documents, for which the gold-standard catchphrases (annotated by legal practitioners) are obtained from a popular legal information system. The performance of our proposed method is compared with that of several existing supervised and unsupervised methods, and our proposed method is empirically shown to be superior to all baselines.

  相似文献   

13.
Given a set of low-quality line-delimited tabular documents of the same layout, we present a robust zoning algorithm which exploits both intra- and inter-document consensus to extract the structure of the table. The structure is captured in the form of a document template, that can then be snapped to a new document to perform automated “cookie cutter” data extraction. We also report a companion consensus-based algorithm for the classification of zone content as either machine print, handwriting or empty. Using scanned Census records from 1841 to 1881, the template is recovered with an efficiency of.076 [0, 1). Using consensus over about 10 documents from each data set, this error was reduced to.0076, or by 90%, which amounts to two missing line segments and one false positive. Similarly, the error for coverage was reduced from 0.098 to 0.016, or by 83%. Use of consensus also resulted in machine print classification accuracy of 100% for two of the three data sets. The classification error for handwriting averaged 0.1225 per document. By exploiting consensus within and between documents, automated zoning and labeling is greatly improved, providing field-level indexing of document content. Heath Nielson received his B.S. in 1998 and an M.S. degree in 2003 from Brigham Young University, Prove, Utah, in computer science. He is now working at the Church of Jesus Christ of Latter-day Saints on microfilm scanning technology at Salt Lake City, Utah. William Barrett received his Ph.D. (1978) in medical biophysics and computing and his undergraduate degree in mathematics from the University of Utah. He was a research fellow at the National Institutes of Health in the Division of Computer Research and Technology, where he worked with the National Heart, Lung, and Blood Institute. He is also a member of IEEE and ACM and has over 60 refereed publications. He is currently at BYU and heads an active research group that works in the areas of computer vision, pattern recognition, and image processing.  相似文献   

14.
Document Similarity Using a Phrase Indexing Graph Model   总被引:3,自引:1,他引:2  
Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.  相似文献   

15.
In this paper we report an investigation into the learning of semi-structured document categorization. We automatically discover low-level, short-range byte data structure patterns from a document data stream by extracting all byte sub-sequences within a sliding window to form an augmented (or bounded-length) string spectrum feature map and using a modified suffix trie data structure (called the coloured generalized suffix tree or CGST) to efficiently store and manipulate the feature map. Using the CGST we are able to efficiently compute the stream's bounded-length sequence spectrum kernel. We compare the performance of two classifier algorithms to categorize the data streams, namely, the SVM and Naive Bayes (NB) classifiers. Experiments have provided good classification performance results on a variety of document byte streams, particularly when using the NB classifier under certain parameter settings. Results indicate that the bounded-length kernel is superior to the standard fixed-length kernel for semi-structured documents.  相似文献   

16.
We demonstrate a text-mining method, called associative Naïve Bayes (ANB) classifier, for automated linking of MEDLINE documents to gene ontology (GO). The approach of this paper is a nontrivial extension of document classification methodology from a fixed set of classes C={c1,c2,…,cn} to a knowledge hierarchy like GO. Due to the complexity of GO, we use a knowledge representation structure. With that structure, we develop the text mining classifier, called ANB classifier, which automatically links Medline documents to GO. To check the performance, we compare our datasets under several well-known classifiers: NB classifier, large Bayes classifier, support vector machine and ANB classifier. Our results, described in the following, indicate its practical usefulness.  相似文献   

17.
18.
19.
In order to process large numbers of explicit knowledge documents such as patents in an organized manner, automatic document categorization and search are required. In this paper, we develop a document classification and search methodology based on neural network technology that helps companies manage patent documents more effectively. The classification process begins by extracting key phrases from the document set by means of automatic text processing and determining the significance of key phrases according to their frequency in text. In order to maintain a manageable number of independent key phrases, correlation analysis is applied to compute the similarities between key phrases. Phrases with higher correlations are synthesized into a smaller set of phrases. Finally, the back-propagation network model is adopted as a classifier. The target output identifies a patent document’s category based on a hierarchical classification scheme, in this case, the international patent classification (IPC) standard. The methodology is tested using patents related to the design of power hand-tools. Related patents are automatically classified using pre-trained neural network models. In the prototype system, two modules are used for patent document management. The automatic classification module helps the user classify patent documents and the search module helps users find relevant and related patent documents. The result shows an improvement in document classification and identification over previously published methods of patent document management.  相似文献   

20.
In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume application. Integration into production lines is under execution. Received March 30, 2000 / Revised June 26, 2001  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号