首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 157 毫秒
1.
本文介绍了英文标书文档处理系统VHTender的设计和实现过程,该系统能够从纸质标书的扫描图象中抽取文档信息,将其转换成电子标书,本文从功能实现的角度,介绍了该系统在一些关键性技术中所采用的方法和策略。  相似文献   

2.
为了更好地管理EAST的项目文档和实验文档,以促进实验的顺利运行及各部门之间的相互协作,专门研究与设计了一个适用于EAST的基于B/S结构的文档管理系统。系统采用UID作为文档的唯一标识,通过文档UID可以直接检索定位文档。Web前端运用Ajax技术与jQuery EasyUI,实现页面的局部更新。结合开源PHP框架完成了系统的设计开发,最终实现了文档分类存储、在线操作文档、文档版本控制、文档权限管理及工作流等功能。该系统已成功应用于EAST实验中,极大地提高了各部门的工作效率。  相似文献   

3.
一个基于规则的图书逻辑结构提取算法   总被引:1,自引:0,他引:1  
在数字图书馆建设中,一个急需解决的问题是如何自动化地将海量的纸张图书数字化为电子文档。对于生成图书电子文档而言,除了文档内容信息以外,文档版面信息和文档逻辑信息同样重要。该文提出了一种基于规则的图书逻辑结构提取算法。从多页图书文档的模型描述出发,通过采用基于规则的推理方法,提取出图书中的逻辑元素并确定各元素间的层次关系和相互联系,从而得到了整本图书的逻辑结构。实验结果证明了算法的有效性。  相似文献   

4.
文档类型定义(DTD)是一类文档逻辑结构的共同特征的规范化描述,作为文档内容层次关系描述的结构是文档类型定义的一个具体体现,并被文档类型定义所制约。通过采用一种快速的定位方法来支持文档结构节点在文档类型定义中的定位,本文提出了一个基于文档类型定义约束的文档结构生成算法,该算法可为基于结构的文档处理提供高效的实时约束机制和更严格的验证机制。  相似文献   

5.
针对采用机器学习方法识别流式文档结构时语料库稀少、语料标注复杂的问题,该文在研究文档的逻辑结构和编辑语义特征的基础上,确立流式文档逻辑结构标注体系,并提出一种三段式的半自动文档逻辑结构标注方法: 第一阶段通过机助人工实现文档元数据的分离式标注,第二阶段自动重建逻辑结构,第三阶段自动填充特征向量。实验结果表明,该文提出的文档逻辑结构标注方法能够节省人工成本、提高机器学习算法对文档结构识别的准确率与召回率,F值达到97.5%。  相似文献   

6.
利用Word编辑长文档时,如编辑论文、书籍、标书等.可能会经常遇到这样的问题:在竖向排版的页面中播入横向排版的图表页面;某一段内容单独编排页码;分章节编排页眉/页脚等.解决这些问题最好的方法就是——在文档中使用分节符。  相似文献   

7.
邵留国  高阳 《计算机工程》2004,30(20):104-106
Donino和XML结构的相似性使Domino能方便、高效地存储XML文档。XML文档可以存储在Domino表单、域、页面、文件系统,不需存储的XML数据可以在系统内存中处理。文章介绍了通过DOM实现XML文档在Domino存储的方法。  相似文献   

8.
表格文档在日常生活中运用十分广泛 ,它应用于人口普查、银行票据、各类报表等领域 ,对这类文档进行计算机自动处理具有重要的现实意义。表格文档信息处理系统主要由文档原始图像获取、文档结构提取和填写信息识别等部分组成。在分析了国内外表格文档信息自动录入系统的优缺点后 ,采用一种基于接触式图像传感器 (CIS)摄取表格文档的原始图像信号 ,利用硬件获得了高质量的图像信号。采用光学字符识别 (OCR)技术对填写的表格文档信息进行识别。该表格文档信息处理系统具有对表格文档的纸张和填写的要求低和识别准确度高的特点。  相似文献   

9.
基于知识模式的文档描述构建方法*   总被引:1,自引:0,他引:1  
鉴于传统文档分析方法不能有效获取弱结构文档的知识描述,提出基于知识模式的文档描述构建方法。该方法综合考虑知识的行文模式与上下文结构特征,从而能够比传统方法更为有效地获取弱结构文档的知识描述。  相似文献   

10.
张真  李宁  田英爱 《计算机工程》2020,46(1):60-66,73
流式文档结构识别对于排版格式自动优化和信息提取等具有重要作用。基于规则的结构识别方法泛化能力较差,而基于机器学习的方法未考虑文档单元之间的长距离依赖关系,识别准确率较低。针对该问题,提出一种基于双向长短期时间记忆(LSTM)网络的流式文档结构识别方法。从文档单元的格式、内容与语义3个方面筛选关键特征,并将文档结构识别看作序列标注问题,使用双向LSTM神经网络构建识别模型,以实现对18种逻辑标签的识别。实验结果表明,该方法能够对文档结构进行有效识别,其识别效果优于方正飞翔软件。  相似文献   

11.
The most noticeable characteristic of a construction tender document is that its hierarchical architecture is not obviously expressed but is implied in the citing information. Currently available methods cannot deal with such documents. In this paper, the intra-page and inter-page relationships are analyzed in detail. The creation of citing relationships is essential to extracting the logical structure of tender documents. The hierarchy of tender documents naturally leads to extracting and displaying the logical structure as tree structure. This method is successfully implemented in VHTender, and is the key to the efficiency and flexibility of the whole system. Received February 28, 2000 / Revised October 20, 2000  相似文献   

12.
Structure analysis of table form documents is an important issue because a printed document and even an electronic document do not provide logical structural information but merely geometrical layout and lexical information. To handle these documents automatically, logical structure information is necessary. In this paper, we first analyze the elements of the form documents from a communication point of view and retrieve the grammatical elements that appear in them. Then, we present a document structure grammar which governs the logical structure of the form documents. Finally, we propose a structure analysis system of the table form documents based on the grammar. By using grammar notation, we can easily modify and keep it consistent, as the rules are relatively simple. Another advantage of using grammar notation is that it can be used for generating documents only from logical structure. In our system, documents are assumed to be composed of a set of boxes and they are classified as seven box types. Then the box relations between the indication box and its associated entry box are analyzed based on the semantic and geometric knowledge defined in the document structure grammar. Experimental results have shown that the system successfully analyzed several kinds of table forms.  相似文献   

13.
As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.  相似文献   

14.
基于文档树的XML文件转换   总被引:1,自引:0,他引:1  
随着互联网与XML技术的不断发展,实现XML文件与非结构化的文本文件之间的相互转换的要求日趋提高,针对该问题,文章提出了一种基于文档树的XML文件转换方法。该方法通过文档树的形式描述文本文件的结构与内容,在特定的映射规则下对文档树进行遍历以实现RTF文件为代表的文本文件与XML文件的相互转换,最后介绍了文档树的构造及相关算法.  相似文献   

15.
This paper presents an efficient method for extracting a logical structure from a Web document. The proposed method consists of three phases: visual grouping, element identification, and logical grouping. To produce a logical structure more accurately, the proposed method defines a document model that is able to describe logical structure information of a specific document class. Since the proposed method is based on a visual structure from the visual grouping phase as well as a document model that describes logical structure information of a document type, it supports sophisticated structure analysis. Experimental results with HTML documents from the Web show that the method has performed logical structure analysis successfully, compared with previous work. Particularly, the method generates XML documents as the result of structure analysis, so that it enhances the reusability of documents.  相似文献   

16.
基于网页结构树的Web信息抽取方法   总被引:10,自引:1,他引:9  
陈琼  苏文健 《计算机工程》2005,31(20):54-55,140
提出了网页结构树提取算法及基于网页结构树的Web信息抽取方法。抽取信息时,在网页结构树中定位模式库中的待抽取信息,用模式库中的待抽取信息和网页结构树的叶结点对应的网页信息进行匹配。因而对网页信息的抽取,可以转化为对网页结构树的树叶结点信息的查找。实验证明,该方法具有较强的网页信息抽取能力。  相似文献   

17.
采用索引技术,对输入的XML文档建立一个双索引结构来改进YFilter算法,优化XML文档过滤性能。藉助索引结构,该算法超前搜索元素结点在文档中的结构信息,预先排除不能保证得到任何匹配结果的元素结点,以避免大量不必要的查询处理。实验结果显示,当输入的XML文档较大时,该算法有较好的过滤性能。  相似文献   

18.
In this paper, we describe experimental methods of recognizing the document structures of various types of documents in the framework of document understanding. Namely, we interpret document structures with individually characterized document knowledge. The document understanding process is divided into three procedures: the first is the recognition of document structures from a two-dimensional point of view; the second is the recognition of item relationships from a one-dimensional point of view; and the third is the recognition of characters from a zero-dimensional point of view. The procedure for recognizing structures plays the most important role in document understanding. This procedure extracts and classifies the logical item blocks from paper-based documents distinctly. We discuss the structure recognition methods for three classes of documents: 1) table-form documents, filled-in forms, cataloging lists, etc. — each item block is surrounded by horizontal and vertical line segments; 2) library cataloging cards, name cards, letters, etc. — each item block is separated by spaces; 3) newspapers, pamphlets, etc. — each item block is constructed hierarchically and by combining under roughly specified layouts. The structure recognition procedure is characterized by individual recognition methods: in class 1 documents, binary trees indicating the connective relationships among neighboring item blocks, which are surrounded by line segments; in class 2 documents, binary trees defining the spatial and geometric relationships among neighboring item blocks, which are separated by spaces; and in class 3 documents, composition rules specifying the constructive relationships among neighboring item blocks, which are represented by adjacent relationship graphs. The methods are effective under the knowledge-based frame-work and are integrated complementarily from the top-down (model-driven) and bottom-up (data-driven) approaches. Of course, the integration means vary according to document classes.  相似文献   

19.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号