首页 | 官方网站   微博 | 高级检索  
     

基于知识的多页文档逻辑结构的分析和理解
引用本文:王姝华,李佐,蔡士杰,曹阳.基于知识的多页文档逻辑结构的分析和理解[J].计算机应用与软件,2002,19(4):33-37.
作者姓名:王姝华  李佐  蔡士杰  曹阳
作者单位:1. 南京大学计算机软件新技术国家重点实验室,南京,210093
2. 香港理工大学建筑与房地产系,香港
摘    要:文档图像理解中最重要的部分是逻辑结构的提取。目前的研究主要集中在页面的布局分析上,少数对文档逻辑结构的研究只是针对单页文档或页面关系简单的多页文档。建筑标书的特殊性在于其层次式的逻辑组成结构没有明确的索引信息标识。本文提出了一种利用页面间引用关系获取文档逻辑结构的方法。该方法采用修正的树形结构表示文档的逻辑结构,逻辑树的创建过程就是逻辑结构的获取过程,而且有利于更高层的语义处理及还原输出。该方法已在标书自动处理系统中实现,保证了该系统的灵活和高效。

关 键 词:文档理解  文档处理  布局分析  物理结构  逻辑结构

A KNOWLEDGE- BASED APPROACH TO LOGICAL STRUCTURE ANALYSIS AND UNDERSTANDING FOR MULTI- PAGE DOCUMENTS
Wang Shuhua Li Zuo Cai Shijie.A KNOWLEDGE- BASED APPROACH TO LOGICAL STRUCTURE ANALYSIS AND UNDERSTANDING FOR MULTI- PAGE DOCUMENTS[J].Computer Applications and Software,2002,19(4):33-37.
Authors:Wang Shuhua Li Zuo Cai Shijie
Abstract:The most important part of document image understanding technology is to extract logical structure of the document. Currently,the main research is focused on kyout analysis, and only less work is aimed at single - page documents or multi - page documents with simple logical structure. The noticeable characteristic of construction tender document is that the hierarchical architecture is not obviously expressed but implied in citing information. In this paper, a new document logical structure extracting method which makes use of the citing information is presented. The hierarchy of tender documents itself leads to extracting their logical structures and dispkying them as modified tree structure. The creation of logical tree corresponds to the procedure of logical structure extracting. Such data structure is useful for higher level semantic processing and reconstruction.This method which ensures efficiency and flexibility of the whole system has been successfully implemented in VHTendei-a tender automatically processing system.
Keywords:Document understanding Document processing Layout analysis Physical structure Logical structure
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号