首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.  相似文献   

2.
The most noticeable characteristic of a construction tender document is that its hierarchical architecture is not obviously expressed but is implied in the citing information. Currently available methods cannot deal with such documents. In this paper, the intra-page and inter-page relationships are analyzed in detail. The creation of citing relationships is essential to extracting the logical structure of tender documents. The hierarchy of tender documents naturally leads to extracting and displaying the logical structure as tree structure. This method is successfully implemented in VHTender, and is the key to the efficiency and flexibility of the whole system. Received February 28, 2000 / Revised October 20, 2000  相似文献   

3.
Successful applications of digital libraries require structured access to sources of information. This paper presents an approach to extract the logical structure of text documents. The extracted structure is explicated by means of SGML (Standard Generalized Markup Language). Consequently, the extraction is achieved on the basis of grammars that extend SGML with recognition rules. From these grammars parsing automata are generated. These automata are used to partition a flat text document into its elements, to discard formatting information, and to insert SGML markups. Complex document structures and fallback rules needed for error tolerant parsing make such automata highly ambiguous. A novel parsing strategy has been developed that ranks and prunes ambiguous parsing paths.  相似文献   

4.
文档图像理解中最重要的部分是逻辑结构的提取。目前的研究主要集中在页面的布局分析上,少数对文档逻辑结构的研究只是针对单页文档或页面关系简单的多页文档。建筑标书的特殊性在于其层次式的逻辑组成结构没有明确的索引信息标识。本文提出了一种利用页面间引用关系获取文档逻辑结构的方法。该方法采用修正的树形结构表示文档的逻辑结构,逻辑树的创建过程就是逻辑结构的获取过程,而且有利于更高层的语义处理及还原输出。该方法已在标书自动处理系统中实现,保证了该系统的灵活和高效。  相似文献   

5.
一个基于规则的图书逻辑结构提取算法   总被引:1,自引:0,他引:1  
在数字图书馆建设中,一个急需解决的问题是如何自动化地将海量的纸张图书数字化为电子文档。对于生成图书电子文档而言,除了文档内容信息以外,文档版面信息和文档逻辑信息同样重要。该文提出了一种基于规则的图书逻辑结构提取算法。从多页图书文档的模型描述出发,通过采用基于规则的推理方法,提取出图书中的逻辑元素并确定各元素间的层次关系和相互联系,从而得到了整本图书的逻辑结构。实验结果证明了算法的有效性。  相似文献   

6.
Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.  相似文献   

7.
Automatic document processing: A survey   总被引:8,自引:0,他引:8  
  相似文献   

8.
Yaron Wolfsthal 《Software》1991,21(6):625-638
A critical problem in the design of editors for structured documents is that of style control, i.e. mapping the logical elements of the documents to their physical appearance on pages. This paper presents a novel approach to style control, used in the Quill document editing system that has been prototyped at the IBM Almaden Research Center. In our approach, the style control mechanism is an integral part of the editing system and consistent with the overall system architecture, in both its inner structure and its user interface. Properties that specify the formatting process, together with action routines for specifying complex semantics, are the basic style control primitives in the proposed approach.  相似文献   

9.
Retrieval of document fragments has a great potential for application in engineering information management. Frequently engineers have neither the time nor inclination to sift through long documents for small pieces of useful information. Yet it is frequently in the form of one or more long documents that the information that they seek is presented. Supporting the delivery of the right information, in the right format and in the right quantity motivates the search for better ways of handling document sub-components or fragments. Document fragment retrieval can be facilitated using modern computational technologies. This paper proposes a novel framework for information access utilising state-of-the-art computational technologies and introducing the use of multiple document structure views through decomposition schemes. The framework integrates document structure study, mark-up technologies, automated fragment extraction, faceted classification and a document navigation mechanism to achieve the target of retrieval of specific document fragments using precise, complex queries. These disparate elements have been brought together in an exploratory Engineering Document Content Management System (EDCMS). Using this, investigations using representative engineering documents have shown that information users can access and retrieve document content – at fragment level rather than at document level – both through data in a document and document metadata, through different perspectives and at different granularities, and simultaneously across multiple documents as well as within a single document.  相似文献   

10.
Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure’s legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.  相似文献   

11.
This paper presents a predicate-driven document filing system for organizing and automatically filing documents. A document model consists of two basic elements: frame templates representing document classes, and folders which are repositories of frame instances. The frame templates can be organized to form a document type hierarchy, which helps classify and file documents. Frame instances are grouped into a folder on the basis of user-defined criteria, specified as predicates which determine whether a frame instance belongs to a folder. Folders can naturally organized into a folder organization which represents the user's real world document filing system. The predicate consistency problem is discussed to eliminate two abnormalities from a folder organization: inapplicable edges (filing paths) and redundant folders. An evaluating net (including an association dictionary, an instantiation component and a production system) is then proposed for evaluating whether a frame instance satisfies the predicate of a folder during document filing. And the concept of consistency a rule base is also discussed.This work was supported by the Separately Budgeted Research (SBR) grant (No. 421190) from New Jersey Institute of Technology and the Systems Integration Program grant from AT&T Foundation  相似文献   

12.
In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume application. Integration into production lines is under execution. Received March 30, 2000 / Revised June 26, 2001  相似文献   

13.
围绕目前出版社在对数字化内容进行跨终端发布时遇到的问题,重点对PDF文档的版面信息抽取和跨终端自适应重组等技术进行研究,提出了针对PDF文档中文本、图片等信息的抽取方法和版面结构分析方法,利用终端自适应重组算法对数字化内容进行跨终端发布;以此为基础设计了一套数字内容跨终端发布的系统,并应用在出版社的实际工作中,实验结果证明了方案的可行性。  相似文献   

14.
针对因应急文档知识查找和利用效率不高造成应急决策者不能快速有效制定应急决策的问题,从知识系统工程的角度出发,结合知识元理论对应急文档知识进行结构化建模,为决策者快速有效地使用应急文档知识提供了一种新的途径。通过对物理结构分析提取元数据和进行文档结构化处理,对逻辑结构分析提出知识元提取的方法,知识元导航链接建立知识与结构化文档间的关联,进行知识推理与检索,并对应急文档的细粒度知识挖掘模式进行了深入的探讨。最后开发了应急决策知识支持系统原型并进行了验证,结果表明该建模方法能有效解决应急文档知识查找和利用效率不高的问题。  相似文献   

15.
A Semi-Structured Document Model for Text Mining   总被引:7,自引:0,他引:7       下载免费PDF全文
A semi-structured document has more structured information compared to an ordinary document,and the relation among semi-structured documents can be fully utilized.In order to take advantage of the structure and link information in a semi-structured document for better mining,a structured link vector model (SLVM) is presented in this paper,where a vector represents a document,and vectors‘ elements are determined by terms,document structure and neighboring documents.Text mining based on SLVM is described in the procedure of K-means for briefness and clarity:calculating document similarity and calculating cluster center.The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments,and its F value increases from 0.65-0.73 to 0.82-0.86.  相似文献   

16.
This paper describes a knowledge-based system for classifying documents based upon the layout structure and conceptual information extracted from their contents. The spatial elements in a document are laid out in rectangular blocks which are represented by nodes in an ordered labeled tree, called the “Layout Structure Tree” (L-S Tree). Each leaf node of an L-S Tree points to its corresponding block content. A Knowledge Acquisition Tool (KAT) is devised to perform the inductive learning from L-S Trees of document samples, and then generate the Document Sample Tree and Document Type Tree bases. A testing document is classified if a Document Type Tree is discovered as a substructure of the L-S Tree of the testing document. Then we match the L-S Tree with the Document Sample Trees of the classified document type to find the format of the testing document. The Document Sample Trees and Document Type Trees are called Structural Knowledge Base (SKB). The tree discovering and matching processes involve comparing the SKB trees and a testing document's L-S Tree by using pattern matching and discovering toolkits. Our experimental results demonstrate that many office documents can be classified correctly using the proposed approach.  相似文献   

17.
Topic model can project documents into a topic space which facilitates effective document clustering. Selecting a good topic model and improving clustering performance are two highly correlated problems for topic based document clustering. In this paper, we propose a three-phase approach to topic based document clustering. In the first phase, we determine the best topic model and present a formal concept about significance degree of topics and some topic selection criteria, through which we can find the best number of the most suitable topics from the original topic model discovered by LDA. Then, we choose the initial clustering centers by using the k-means++ algorithm. In the third phase, we take the obtained initial clustering centers and use the k-means algorithm for document clustering. Three clustering solutions based on the three phase approach are used for document clustering. The related experiments of the three solutions are made for comparing and illustrating the effectiveness and efficiency of our approach.  相似文献   

18.
In order to process large numbers of explicit knowledge documents such as patents in an organized manner, automatic document categorization and search are required. In this paper, we develop a document classification and search methodology based on neural network technology that helps companies manage patent documents more effectively. The classification process begins by extracting key phrases from the document set by means of automatic text processing and determining the significance of key phrases according to their frequency in text. In order to maintain a manageable number of independent key phrases, correlation analysis is applied to compute the similarities between key phrases. Phrases with higher correlations are synthesized into a smaller set of phrases. Finally, the back-propagation network model is adopted as a classifier. The target output identifies a patent document’s category based on a hierarchical classification scheme, in this case, the international patent classification (IPC) standard. The methodology is tested using patents related to the design of power hand-tools. Related patents are automatically classified using pre-trained neural network models. In the prototype system, two modules are used for patent document management. The automatic classification module helps the user classify patent documents and the search module helps users find relevant and related patent documents. The result shows an improvement in document classification and identification over previously published methods of patent document management.  相似文献   

19.
Both of an automatic classification method for original documents based on image feature and a layout analysis method based on rule hypothesis tree are proposed. Then an intelligent document-filling system by electronizing the original documents, which can be applied to cellphones and pads is designed. When users are filling documents online, information can be automatically input to the financial information system merely by taking photos of the original documents. By this means can not only save time but also ensure the accuracy between the data online and that on the original documents. Experiments show that the accuracy of document classification is 88.38%, the accuracy of document-filling is 87.22%, and it takes 5.042 seconds dealing with per document. The system can be applied to financial, government, libraries, electric power, enterprises and many other industries, which has high economic and application value.  相似文献   

20.
Knowledge-based systems for document analysis and understanding (DAU) are quite useful whenever analysis has to deal with the changing of free-form document types which require different analysis components. In this case, declarative modeling is a good way to achieve flexibility. An important application domain for such systems is the business letter domain. Here, high accuracy and the correct assignment to the right people and the right processes is a crucial success factor. Our solution to this proposes a comprehensive knowledge-centered approach: we model not only comparatively static knowledge concerning document properties and analysis results within the same declarative formalism, but we also include the analysis task and the current context of the system environment within the same formalism. This allows an easy definition of new analysis tasks and also an efficient and accurate analysis by using expectations about incoming documents as context information. The approach described has been implemented within the VOPR (VOPR is an acronym for the Virtual Office PRototype.) system. This DAU system gains the required context information from a commercial workflow management system (WfMS) by constant exchanges of expectations and analysis tasks. Further interaction between these two systems covers the delivery of results from DAU to the WfMS and the delivery of corrected results vice versa. Received June 19, 1999 / Revised November 8, 2000  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号