首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 273 毫秒
1.
Much of the world’s quantitative data reside in scattered web tables. For a meaningful role in Big Data analytics, the facts reported in these tables must be brought into a uniform framework. Based on a formalization of header-indexed tables, we proffer an algorithmic solution to end-to-end table processing for a large class of human-readable tables. The proposed algorithms transform header-indexed tables to a category table format that maps easily to a variety of industry-standard data stores for query processing. The algorithms segment table regions based on the unique indexing of the data region by header paths, classify table cells, and factor header category structures of two-dimensional as well as the less common multidimensional tables. Experimental evaluations substantiate the algorithmic approach to processing heterogeneous tables. As demonstrable results, the algorithms generate queryable relational database tables and semantic-web triple stores. Application of our algorithms to 400 web tables randomly selected from diverse sources shows that the algorithmic solution automates end-to-end table processing.  相似文献   

2.
We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.  相似文献   

3.
利用单元格和特征点实现图纸信息的自动提取   总被引:2,自引:0,他引:2  
工程图中的标题栏和明细栏是产品数据集中管理的重要数据来源.出于重用CAD数据的考虑,提出了一种有效的工程图零部件信息提取方法.通过分析工程图中标题栏和明细栏的形式,从宏观布局和微观结构出发归纳了表格的位形特征;提出了基于单元格和特征点的图纸数据自动提取策略,详细介绍了算法思想和实施步骤;开发了实用程序并在工程项目中得到应用.  相似文献   

4.
构建知网关系的网状表示   总被引:9,自引:2,他引:7  
本文介绍了一个针对知网关系的网状表示结构及其实现方法。通过构建三张数据表:概念表、特征表和关系表,以及建立它们的记录项之间的双向多元联系,可以方便地把知网的所有知识(概念、特征以及它们之间的各种关系) 集成在一起,从而为进一步进行基于知网的信息检索和知识推理打下很好的基础。  相似文献   

5.
Tabular data often refers to data that is organized in a table with rows and columns. We observe that this data format is widely used on the Web and within enterprise data repositories. Tables potentially contain rich semantic information that still needs to be interpreted. The process of extracting meaningful information out of tabular data with respect to a semantic artefact, such as an ontology or a knowledge graph, is often referred to as Semantic Table Interpretation (STI) or Semantic Table Annotation. In this survey paper, we aim to provide a comprehensive and up-to-date state-of-the-art review of the different tasks and methods that have been proposed so far to perform STI. First, we propose a new categorization that reflects the heterogeneity of table types that one can encounter, revealing different challenges that need to be addressed. Next, we define five major sub-tasks that STI deals with even if the literature has mostly focused on three sub-tasks so far. We review and group the many approaches that have been proposed into three macro families and we discuss their performance and limitations with respect to the various datasets and benchmarks proposed by the community. Finally, we detail what are the remaining scientific barriers to be able to truly automatically interpret any type of tables that can be found in the wild Web.  相似文献   

6.
In documents, tables are important structured objects that present statistical and relational information. In this paper, we present a robust system which is capable of detecting tables from free style online ink notes and extracting their structure so that they can be further edited in multiple ways. First, the primitive structure of tables, i.e., candidates for ruling lines and table bounding boxes, are detected among drawing strokes. Second, the logical structure of tables is determined by normalizing the table skeletons, identifying the skeleton structure, and extracting the cell contents. The detection process is similar to a decision tree so that invalid candidates can be ruled out quickly. Experimental results suggest that our system is robust and accurate in dealing with tables having complex structure or drawn under complex situations.  相似文献   

7.
We present a method for structuring a document according to the information present in its different organizational tables: table of contents, tables of figures, etc. This method is based on a two-step approach that leverages functional and formal (layout-based) kinds of knowledge. The functional definition of organizational table, based on five properties, is used to provide a first solution, which is improved in a second step by automatically learning the form of the table of contents. We also report on the robustness and performance of the method and we illustrate its use in a real conversion case.  相似文献   

8.
Web表格知识抽取是一种重要的获取高质量知识的途径,在知识图谱、网页挖掘等方面具有广泛的研究意义与应用价值。传统的Web表格知识抽取方法主要依赖于良好的表格结构和足够的先验知识,但在复杂的表格结构以及先验知识不足等情形下难以奏效。针对这类方法的问题,该文通过充分利用表格自身的结构特点,提出了一套可面向大规模数据的基于等价压缩快速聚类的Web表格知识抽取方法,以无监督的聚类方式获得相似形式结构的表格,从而推测其语义结构以抽取知识。实验结果表明,基于等价压缩的快速聚类算法在保持同水平的聚类准确率的前提下,在时间性能上相比传统方法有大幅度的提升,5 000个表格的聚类时间由72小时缩短为20分钟,且在表格聚类后利用表格模板所抽取的知识三元组的准确率也达到了令人满意的结果。  相似文献   

9.
Web信息的自主抽取方法   总被引:12,自引:0,他引:12  
许建潮  侯锟 《计算机工程与应用》2005,41(14):185-189,198
提出了基于表格结构及列表结构的W eb页面信息自主抽取的方法。可根据用户对信息的需求自主地从相关页面中抽取信息并将抽取信息按关系模型进行重组存放在数据库中,对表格结构信息源仅需标注一页网页,即可获取抽取知识,通过自学习能够较好地适应网页信息的动态变化,实现信息的自动抽取。对列表结构信息源信息,通过对DOM树结构的分析,动态获得信息块在DOM层次结构中的路径,根据信息对象基本的抽取知识,获得信息对象值。采用自学习的方法以适应网页信息的动态变化。  相似文献   

10.
建立临时关系的目的是使子表的记录指针随父表的记录指针的移动而移动,从而达到同时浏览多个表中数据的目的,本文首先简单介绍了表的关联、数据工作期和临时关系等基本概念,然后通过举例重点介绍了运用数据工作期如何建立表间关联、如何实施查询的步骤和方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号