首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.  相似文献   

2.
This paper presents an efficient method for extracting a logical structure from a Web document. The proposed method consists of three phases: visual grouping, element identification, and logical grouping. To produce a logical structure more accurately, the proposed method defines a document model that is able to describe logical structure information of a specific document class. Since the proposed method is based on a visual structure from the visual grouping phase as well as a document model that describes logical structure information of a document type, it supports sophisticated structure analysis. Experimental results with HTML documents from the Web show that the method has performed logical structure analysis successfully, compared with previous work. Particularly, the method generates XML documents as the result of structure analysis, so that it enhances the reusability of documents.  相似文献   

3.
Successful applications of digital libraries require structured access to sources of information. This paper presents an approach to extract the logical structure of text documents. The extracted structure is explicated by means of SGML (Standard Generalized Markup Language). Consequently, the extraction is achieved on the basis of grammars that extend SGML with recognition rules. From these grammars parsing automata are generated. These automata are used to partition a flat text document into its elements, to discard formatting information, and to insert SGML markups. Complex document structures and fallback rules needed for error tolerant parsing make such automata highly ambiguous. A novel parsing strategy has been developed that ranks and prunes ambiguous parsing paths.  相似文献   

4.
Extensible Markup Language (XML) is a simple, flexible text format derived from SGML, which is originally designed to support large-scale electronic publishing. Nowadays XML plays a fundamental role in the exchange of a wide variety of data on the Web. As XML allows designers to create their own customized tags, enables the definition, transmission, validation, and interpretation of data between applications, devices and organizations, lots of works in soft computing employ XML to take control and responsibility for the information, such as fuzzy markup language, and accordingly there are lots of XML-based data or documents. However, most of mobile and interactive ubiquitous multimedia devices have restricted hardware such as CPU, memory, and display screen. So, it is essential to compress an XML document/element collection to a brief summary before it is delivered to the user according to his/her information need. Query-oriented XML text summarization aims to provide users a brief and readable substitution of the original retrieved documents/elements according to the user’s query, which can relieve users’ reading burden effectively. We propose a query-oriented XML summarization system QXMLSum, which extracts sentences and combines them as a summary based on three kinds of features: user’s queries, the content of XML documents/elements, and the structure of XML documents/elements. Experiments on the IEEE-CS datasets used in Initiative for the Evaluation of XML Retrieval show that the query-oriented XML summary generated by QXMLSum is competitive.  相似文献   

5.
利用关系表构建XML文档解析的树模型   总被引:2,自引:1,他引:1  
祝青  阳王东 《计算机应用》2009,29(6):1719-1721
在对XML文档的数据解析和查询操作研究中,发现树能较好地反映XML文档的层次结构,但其查询效率较低,而关系表是一种适合存储大量数据且有较好查询效率与操作功能的数据结构。给出了一个把树和关系表相结合构建一种存储XML文档的数据模型;在这个模型的解析过程中,采用回调事件式的分段解析方法以减少解析时间和存储空间。这样既能较好保存XML文档的结构特点,又能提高其查询的效率和操作的便利性。通过对大数据量XML文档的解析和操作实验,实验结果证明这种数据模型在处理大型XML文档中具有明显优势。  相似文献   

6.
对于基于DTD在关系数据库中存储XML文档,此处利用结点模型映射方法,实现用关系模式来表示目标XML文档的逻辑结构(即 XML模式或DTD).还介绍了如何在已建立好的关系模式中添加约束用来保持原有XML文档中隐含的约束信息,此外XML文档的元素之间通常是相互递归的,这里也对XML文档中在出现递归的情况时,如何来存储递归的XML文档进行说明.最后通过举例,证明此种方法是合理有效的.  相似文献   

7.
本文对细粒度XML文档上的BIBA严格完整性策略进行了研究。通过对XML文档的结构约束进行分析,建立了XML文档上的完整性约束规则,从而将BIBA严格完整性策略扩展到了XML文档上;建立一个包含完整性属性的XML文档模型,对XML文档的结构特点进行了分析;提出了完整性标签传播规则,以支持部分标记完整性的XML文档;最后对完整性策略的实现机制进行了讨论。  相似文献   

8.
The next generation of interactive multimedia documents can contain both static media, e.g., text, graph, image, and continuous media, e.g., audio and video, and can provide user interactions in distributed environments. However, the temporal information of multimedia documents cannot be described using traditional document structures, e.g., Open Document Architecture (ODA) and Standard Generalized Mark-up Language (SGML); the continuous transmission of media units also raises some new synchronization problems, which have not been met before, for processing user interactions. Thus, developing a distributed interactive multimedia document system should resolve the issues of document model, presentation control architecture, and control scheme. In this paper, we (i) propose a new multimedia document model that contains the logical structure, the layout structure, and the temporal structure to formally describe multimedia documents, and (ii) point out main interaction-based synchronization problems, and propose a control architecture and a token-based control scheme to solve these interaction-based synchronization problems. Based on the proposed document model, control architecture, and control scheme, a distributed interactive multimedia document development mechanism, which is called MING-I, is developed on SUN workstations.  相似文献   

9.
基于PCA的XML文档特征提取方法   总被引:1,自引:0,他引:1  
郭丽红  王箭 《计算机工程与设计》2011,32(11):3894-3896,3911
为了更好地对XML文档进行分类或聚类分析,以主成分分析的理论基础为指导,在研究了文本表示的各种模型的基础上,提出了两种对XML文档进行向量化表示并进行特征提取的方法,同时也实现了对XML文档的有效降维。实验结果表明,两种方法都能有效地表示XML文档的主体特征,但全路径特征向量抽取方法能更好地描述XML信息,为下一步有效处理XML文档做了良好铺垫,具有一定的研究价值。  相似文献   

10.
Most documents have a hierarchical structure, which can be made explicit by markup languages such as SGML. In this paper we propose a formal model for representation of hierarchically structured documents, to be used as the basis for document query languages. The model uses a redundant representation of the document elements to simplify the expression of common queries. As an illustration of the power of the model we show how queries might be expressed, both as set-theoretic expressions and in a simple algebra, and outline how queries might be evaluated in a practical system.  相似文献   

11.
文档图像理解中最重要的部分是逻辑结构的提取。目前的研究主要集中在页面的布局分析上,少数对文档逻辑结构的研究只是针对单页文档或页面关系简单的多页文档。建筑标书的特殊性在于其层次式的逻辑组成结构没有明确的索引信息标识。本文提出了一种利用页面间引用关系获取文档逻辑结构的方法。该方法采用修正的树形结构表示文档的逻辑结构,逻辑树的创建过程就是逻辑结构的获取过程,而且有利于更高层的语义处理及还原输出。该方法已在标书自动处理系统中实现,保证了该系统的灵活和高效。  相似文献   

12.
While HTML is mainly designed for the visual rendering of Web documents, XML is widely accepted as a standard format to process and manage information. In particular, it can embed the information of logical structures. However, in order to utilize XML, the logical structures of HTML tables should first be extracted and transformed into XML representations. This paper presents an efficient method for the process, which consists of two phases: area segmentation and structure analysis. The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency. The hierarchical structure between attribute and value areas is then analyzed and transformed into an XML representation using a proposed table model. Experimental results with 1180 HTML tables show that the proposed method performs better than conventional methods, resulting in an average accuracy of 86.7%.  相似文献   

13.
XML文档架构与关系数据模型间的映射研究   总被引:6,自引:2,他引:6  
XML逐渐成为Internet上数据描述和交换的标准。随着Web上大量数据用XML文档表示出来,有必要对这些XML文档进行操纵管理。为了结合关系数据库系统强大的数据操纵能力,论文在对XML文档的逻辑结构进行简要介绍的基础上,就XML文档特别是结构化XML文档与关系数据模型数据之间的互动映射作了深入探讨,特别是在数据结构和数据完整性约束条件的映射关系上作了更深一层的研究,提出了一系列基于XML本身的映射规则。  相似文献   

14.
Transforming paper documents into XML format with WISDOM++   总被引:1,自引:1,他引:0  
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported. Received June 15, 2000 / Revised November 7, 2000  相似文献   

15.
文本分类是自然语言处理领域的核心任务之一,深度学习的发展给文本分类带来更广阔的发展前景。针对当前基于深度学习的文本分类方法在长文本分类中的优势和不足,该文提出一种文本分类模型,在层次模型基础上引入混合注意力机制来关注文本中的重要部分。首先,按照文档的层次结构分别对句子和文档进行编码;其次,在每个层级分别使用注意力机制。句编码时在全局目标向量基础上同时利用最大池化提取句子特定的目标向量,使编码出的文档向量具有更加明显的类别特征,能够更好地关注到每个文本最具区别性的语义特征。最后,根据构建的文档表示对文档分类。在公开数据集和行业数据集上的实验结果表明,该模型对具有层次结构的长文本具有更优的分类性能。  相似文献   

16.
XML文档相似性的仿真研究   总被引:1,自引:0,他引:1  
XML文档相似性的计算是XML文档分类中的一个难题。文中描述了一种基于结构的方法,通过序列化模式挖掘方法,挖掘出两个文档之间的最大相似路径,从而可以通过计算最大相似的路径的节点数目和所有路径的节点数目的比值,得到两个文档之间的相似度。文章提出了一种新的最小化XML文档的方法,并且综合考虑了文档节点的语义相似度和结构相似度,从而进一步地提高了计算文档相似度的精度。实验表明,该方法有着良好的应用前景。  相似文献   

17.
In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.  相似文献   

18.
Document representation and its application to page decomposition   总被引:6,自引:0,他引:6  
Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval, and interpretation continues to be a challenging problem. An efficient representation scheme for document images is necessary to solve this problem. Document representation involves techniques of thresholding, skew detection, geometric layout analysis, and logical layout analysis. The derived representation can then be used in document storage and retrieval. Page segmentation is an important stage in representing document images obtained by scanning journal pages. The performance of a document understanding system greatly depends on the correctness of page segmentation and labeling of different regions such as text, tables, images, drawings, and rulers. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis. Our algorithm has a high accuracy and takes approximately 1.4 seconds on a SGI Indy workstation for model creation, including orientation estimation, segmentation, and labeling (text, table, image, drawing, and ruler) for a 2550×3300 image of a typical journal page scanned at 300 dpi. This method is applicable to documents from various technical journals and can accommodate moderate amounts of skew and noise  相似文献   

19.
基于XML和N层VSM的Web信息检索   总被引:2,自引:0,他引:2  
基于XML文档格式良好、层次清晰,可以方便地操纵、分析其结构的特点。文中在将Web上的HTML文档转化为XML文档的基础上,通过Java中的DOM树,分析文档的层次结构。把文档分为层次化的文本段,对传统的VSM算法进行改进,把每个文本段转换为空间向量,实现了N层VSM算法,通过试验证明,改进后算法的查全率和查准率都要优于传统的VSM算法。  相似文献   

20.
In name and in practice, the World‐Wide Web (hereafter Web) is used around the World beyond English‐speaking areas. This creates a tremendous need to internationalize standard terminology used in the technologies that make the Web possible. Existing efforts on XML internationalization (i18n) and localization (i10n) have focused on the content of XML documents instead of the terms used in markup (annotations) such as elements and attributes. The SGML standard ISO 8879 supports the use of Unicode (ISO 10646) throughout a document, including markups. However, most elements and attributes of XML documents are still defined in English, thereby limiting their use among non‐English speakers. This paper presents an XSLT‐based method that can completely localize the markup of XML documents into different natural languages. We also describe how the proposed technique can be applied to translation problems in programming (e.g. C and Java) or documentation (e.g. LATEX or other formatting languages) so that a program or a document can be converted to and from an XML format. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号