首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A large number of web pages contain data structured in the form of ??lists??. Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well-defined templates??they have inconsistent delimiters (if any) and often have missing information. We propose a novel technique for extracting tables from lists. The technique is domain independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields and then, compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table??s quality. We conducted an extensive experimental study using both real web lists and lists derived from tables on the web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the web. The analysis of the extracted tables has led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the web.  相似文献   

2.
We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.  相似文献   

3.
Hierarchical Wrapper Induction for Semistructured Information Sources   总被引:16,自引:0,他引:16  
With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component of any Web-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document into a series of simpler extraction tasks. We introduce an inductive algorithm, STALKER, that generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that STALKER requires up to two orders of magnitude fewer examples than other algorithms. Furthermore, STALKER can wrap information sources that could not be wrapped by existing inductive techniques.  相似文献   

4.
This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer format can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extracting information from tables and analyse table-related research to place the contribution of different authors, find the paths research is following, and identify issues that are still unsolved. We then analyse current approaches to evaluating table processing algorithms and propose two new metrics for the task of segmenting cells/columns/rows. We proceed to design our own end-to-end method, where there is a higher interaction between different steps; we indicate how back loops in the usual order of the steps can reduce the possibility of errors and contribute to solving previously unsolved problems. Finally, we explore how the actual interpretation of the table not only allows inferring the accuracy of the overall extraction process but also contributes to actually improving its quality. In order to do so, we believe interpretation has to consider context-specific knowledge; we explore how the addition of this knowledge can be made in a plug-in/out manner, such that the overall method will maintain its operability in different contexts.The opinions expressed in this article are the responsibility of the authors and do not necessarily reflect those of Banco de Portugal.  相似文献   

5.
Unsupervised named-entity extraction from the Web: An experimental study   总被引:6,自引:0,他引:6  
The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KnowItAll's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KnowItAll extracted over 50,000 class instances, but suggested a challenge: How can we improve KnowItAll's recall and extraction rate without sacrificing precision?This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., “chemist” and “biologist” are identified as sub-classes of “scientist”). List Extraction locates lists of class instances, learns a “wrapper” for each list, and extracts elements of each list. Since each method bootstraps from KnowItAll's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KnowItAll a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.  相似文献   

6.
The success of bilinear subspace learning heavily depends on reducing correlations among features along rows and columns of the data matrices. In this work, we study the problem of rearranging elements within a matrix in order to maximize these correlations so that information redundancy in matrix data can be more extensively removed by existing bilinear subspace learning algorithms. An efficient iterative algorithm is proposed to tackle this essentially integer programming problem. In each step, the matrix structure is refined with a constrained Earth Mover's Distance procedure that incrementally rearranges matrices to become more similar to their low-rank approximations, which have high correlation among features along rows and columns. In addition, we present two extensions of the algorithm for conducting supervised bilinear subspace learning. Experiments in both unsupervised and supervised bilinear subspace learning demonstrate the effectiveness of our proposed algorithms in improving data compression performance and classification accuracy.  相似文献   

7.
Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches.  相似文献   

8.
针对模板生成网页的一种数据自动抽取方法   总被引:5,自引:0,他引:5  
当前,Web上的很多网页是动态生成的,网站根据请求从后台数据库中选取数据并嵌入到通用的模板中,例如电子商务网站的商品描述网页.研究如何从这类由模板生成的网页中检测出其背后的模板,并将嵌入的数据(例如商品名称、价格等等)自动地抽取出来.给出了模板检测问题的形式化描述,并深入分析模板产生网页的结构特征.提出了一种新颖的模板检测方法,并利用检测出的模板自动地从实例网页中抽取数据.与其他已有方法相比,该方法能够适用于"列表页面"和"详细页面"两种类型的网页.在两个第三方的测试集上进行了实验,结果表明,该方法具有很高的抽取准确率.  相似文献   

9.
卷积神经网络(CNN)具有强大的特征提取能力,能够有效地提高高光谱图像的分类精度.然而CNN模型训练需要大量的训练样本参与,以防止过拟合,Gabor滤波器以非监督的方式提取图像的边缘和纹理等空间信息,能够减轻CNN模型对训练样本的依赖度及特征提取的压力.为了充分利用CNN和Gabor滤波器的优势,提出了一种双通道CNN和三维Gabor滤波器相结合的高光谱图像分类方法Gabor-DC-CNN.首先利用二维卷积神经网络(2D-CNN)模型处理原始高光谱图像数据,提取图像的深层空间特征;同时利用一维卷积神经网络(1D-CNN)模型处理三维Gabor特征数据,进一步提取图像的深层光谱-纹理特征.连接2个CNN模型的全连接层实现特征融合,并将融合特征输入到分类层中完成分类.实验结果表明,该方法能够有效地提高分类精度,在Indian Pines,Pavia University和Kennedy Space Center 3组数据上分别达到98.95%,99.56%和99.67%.  相似文献   

10.
Web表格知识抽取是一种重要的获取高质量知识的途径,在知识图谱、网页挖掘等方面具有广泛的研究意义与应用价值。传统的Web表格知识抽取方法主要依赖于良好的表格结构和足够的先验知识,但在复杂的表格结构以及先验知识不足等情形下难以奏效。针对这类方法的问题,该文通过充分利用表格自身的结构特点,提出了一套可面向大规模数据的基于等价压缩快速聚类的Web表格知识抽取方法,以无监督的聚类方式获得相似形式结构的表格,从而推测其语义结构以抽取知识。实验结果表明,基于等价压缩的快速聚类算法在保持同水平的聚类准确率的前提下,在时间性能上相比传统方法有大幅度的提升,5 000个表格的聚类时间由72小时缩短为20分钟,且在表格聚类后利用表格模板所抽取的知识三元组的准确率也达到了令人满意的结果。  相似文献   

11.
针对奇异值分解(SVD)分析偏好特征不够准确,有时出现不可解释的情况,文中提出利用行列联合选择(CUR)矩阵分解方法获取原始矩阵M(用户对产品的偏好)的低秩近似,提取用户和产品的潜在偏好.首先计算M中行和列的统计影响力得分,并抽取得分较高的若干列和若干行构成低维矩阵C和R,然后由M、C、R近似构造矩阵U,将高维空间中的偏好特征提取问题转化为低维空间中的矩阵分析问题,使其具有较好的可解释性和准确性.最后,通过理论分析和实验发现,与传统分解方法相比,CUR矩阵分解方法在偏好特征提取方面具有更高的准确度、更好的可解释性及更高的压缩率.  相似文献   

12.
现有基于深度学习的轴承故障诊断方法对数据具有一定的依赖性,要求训练数据与测试数据具有相同的分布。在变工况的条件下,网络模型的故障诊断精度会因数据分布发生变化而下降。为保证网络模型能够在变工况条件下对轴承的健康状态进行准确识别,基于无监督域自适应理论,提出一种新颖的智能故障诊断网络模型——动态卷积多层域自适应网络。该网络一方面充分利用动态卷积强有力的特征提取能力,提取更多有效的故障特征;另一方面采用相关对齐实施非线性变换,同时对齐多层故障特征分布的二阶统计量,促进源域的诊断知识向目标域迁移,提高了模型在目标域无故障标签条件下的故障识别准确率。最后,在两个数据集共14个迁移任务下进行实验,实验结果表明,动态卷积多层域自适应网络能够实现较高的故障诊断识别精度。  相似文献   

13.
基于Excel模板的报表生成技术是一项通过定制Excel模板格式、配置模板信息表来实现报表数据填充的技术,尤其针对国内常见的中国式的不规则报表,该技术可以很方便的实现报表的自动生成,并且该解决方案在勘探生产经营管理系统中取得了良好的实际应用效果。  相似文献   

14.
Sullivan  J. 《Computer》1997,30(6)
Does your relational database speak SQL fluently? It's easy to find out, because the SQL (Structured Query Language) Test Suite is now free on the Web. SQL is the standard that lets DBMS products from different vendors interoperate. It defines common data structures (tables, columns, views, and so on) and provides a data manipulation language to populate, update, and query those structures. Accessing structured data with SQL is quite different from searching the full text of documents on the Web. Structured data in the relational model means data that can be represented in tables. Each row represents a different item, and the columns represent various attributes of the item. Columns have names and integrity constraints that specify valid values. Because the column values are named and represented in a consistent format, you can select rows precisely, on the basis of their contents. This is especially helpful in dealing with numeric data. You can also join data from different tables on the basis of matching column values. It is possible to do useful types of analysis too, listing items that are in one table and are missing, present, or have specific attributes in a related table. You can extract from a large table precisely those rows of interest, regroup them, and generate simple statistics on them  相似文献   

15.
16.
韩洁  郭擎  李安 《中国图象图形学报》2017,22(12):1788-1797
目的 目前针对复杂场景高分辨率遥感影像道路提取多采用监督分类方法,但需要人工选择样本,自动化程度低且具有不稳定性。基于像元级的方法,提取完整度低且易产生椒盐噪声;面向对象的方法易产生粘连问题。为了提高道路提取的完整度、准确度和自动化程度,提出一种基于非监督分类和几何—纹理—光谱特征的道路提取方法。方法 首先考虑光谱特征利用非监督分类进行初步分割,结合基于纹理特征分类的结果得到初始道路区域。然后根据道路特征建立一套完整的非道路区域滤除体系:边缘滤波断开道路和非道路的连接、纹理滤波滤除大面积非道路区域、形状滤波去除剩余小面积非道路区域。最后利用张量投票算法得到连贯、平滑的道路中心线。结果 选择复杂场景下的高分辨率IKONOS影像和QuickBird影像进行实验,与国内外基于像素和面向对象的两种有代表性的道路提取方法进行对比,采用完整率、正确率、检测质量3个评价指标进行定量评价。实验结果表明该方法相比于其他算法在完整率、正确率和检测质量上平均提高26.61%、5.57%和26.77%。定性分析结果表明,本文方法可以有效改善椒盐噪声和粘连现象。此外本文方法的自动化程度更高。结论 提出了一种基于非监督分类和几何—纹理—光谱特征的高分辨遥感影像道路提取方法,非监督相对于监督分类的方法有更高的自动化程度,复杂场景下的道路提取融合几何—纹理—光谱特征有效避免了基于像元级道路提取易产生的椒盐噪声现象和面向对象道路提取易产生的粘连现象。该方法适用于高分辨率遥感影像城市道路提取,能够得到较高的完整度、准确度以及自动化程度。非监督分类和多特征结合的道路提取方法有广阔的应用前景。  相似文献   

17.
The kernel minimum squared error (KMSE) expresses the feature extractor as a linear combination of all the training samples in the high-dimensional kernel space. To extract a feature from a sample, KMSE should calculate as many kernel functions as the training samples. Thus, the computational efficiency of the KMSE-based feature extraction procedure is inversely proportional to the size of the training sample set. In this paper, we propose an efficient kernel minimum squared error (EKMSE) model for two-class classification. The proposed EKMSE expresses each feature extractor as a linear combination of nodes, which are a small portion of the training samples. To extract a feature from a sample, EKMSE only needs to calculate as many kernel functions as the nodes. As the nodes are commonly much fewer than the training samples, EKMSE is much faster than KMSE in feature extraction. The EKMSE can achieve the same training accuracy as the standard KMSE. Also, EKMSE avoids the overfitting problem. We implement the EKMSE model using two algorithms. Experimental results show the feasibility of the EKMSE model.  相似文献   

18.
简介当前应用程序用户界面逆向T程的研究现状,对静态和动态两大类方法给予简单的说明和解释。分别用两种方法对简单系统进行界面信息提取实验.并比较提取信息的准确性和完备性。实验结果表明,动态方法更能准确地提取界面的信息。探讨用户界面逆向工程未来研究工作的重点和方向。  相似文献   

19.
Image collections are currently widely available and are being generated in a fast pace due to mobile and accessible equipment. In principle, that is a good scenario taking into account the design of successful visual pattern recognition systems. However, in particular for classification tasks, one may need to choose which examples are more relevant in order to build a training set that well represents the data, since they often require representative and sufficient observations to be accurate. In this paper we investigated three methods for selecting relevant examples from image collections based on learning models from small portions of the available data. We considered supervised methods that need labels to allow selection, and an unsupervised method that is agnostic to labels. The image datasets studied were described using both handcrafted and deep learning features. A general purpose algorithm is proposed which uses learning methods as subroutines. We show that our relevance selection algorithm outperforms random selection, in particular when using unlabelled data in an unsupervised approach, significantly reducing the size of the training set with little decrease in the test accuracy.  相似文献   

20.
The ‘curse of dimensionality’ is a drawback for classification of hyperspectral images. Band extraction is a technique for reducing the dimensionality and makes it computationally less complex for classification. In this article, an unsupervised band extraction method for hyperspectral images has been proposed. In the proposed method, kernel principal component analysis (KPCA) is used for transformation of the original data, which integrates the nonlinear characteristics, as well as, the advantages of principal component analysis and extract higher order statistics of data. The KPCA is highly dependent on the number of patterns for calculating kernel matrix. So, a proper selection of subset of patterns, which represent the original data properly, may reduce the computational cost for the proposed method with considerably better performance. Here, density-based spatial clustering technique is first used to group the pixels according to their similarity, and then some percentages of pixels from each cluster are selected to form the proper subset of patterns. To demonstrate the effectiveness of the proposed clustering- and KPCA-based unsupervised band extraction method, investigation is carried out on three hyperspectral data sets, namely, Indian, KSC, and Botswana. Four evaluation measures, namely classification accuracy, kappa coefficient, class separability, and entropy are calculated over the extracted bands to measure the efficiency of the proposed method. The performance of the proposed method is compared with four state-of-the-art unsupervised band extraction approaches, both qualitatively and quantitatively, and shows promising results compared to them in terms of four evaluation measures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号