共查询到20条相似文献,搜索用时 22 毫秒
1.
提出一种从科技文献等文档中自动抽取元数据的方法,将自动归纳法和相似特征度算法结合起来,基于特征相似的归纳学习算法自动生成抽取规则,并对文档进行元数据的自动抽取。这种方法利用文档自身某些特有属性,对文档的内容进行分块,利用归纳法自动生成抽取规则,并结合特征相似度对生成规则进行匹配,然后对文档元数据信息进行自动抽取,提高了自动生成规则的效率和抽取元数据信息的准确率。 相似文献
2.
3.
4.
5.
针对建筑物在城市化发展规划、地理国情信息系统更新、数字化城市以及军事侦察等方面的迫切要求,提出将半监督鉴别分析(Semi-supervised Discriminant Analysis,SDA)算法应用于高分辨率SAR影像的建筑区提取中,实现快速提取建筑区信息以及提高城市地物目标识别能力。以Radarsat-2影像和TerraSAR-X影像为实验数据,基于灰度共生矩阵计算影像的各种纹理特征;结合SDA算法进行特征提取,并以新特征作为大津法(Otsu)的输入提取建筑区;最后对分类结果进行后处理。实验结果与线性鉴别分析(Linear Discriminant Analysis,LDA)算法和局部保持投影(Local Preserving Projection,LPP)算法进行比较,结果表明:SDA算法具有较强的泛化能力,在先验类别信息较少时,适用于高分辨率SAR影像的特征提取,可以快速有效地提取建筑区信息。 相似文献
6.
World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE)
from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent
and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities.
The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels.
We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities
in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations
with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale
up to very large data sets. 相似文献
7.
基于规则归纳的信息抽取系统实现 总被引:2,自引:0,他引:2
面对Web信息的迅猛增长,信息抽取技术非常适合于从大量的文档中抽取需要的事实数据。通过文档对象模型(DOM)解析以及检索、抽取、映射等规则的定义,设计并实现了一种具有规则归纳能力的信息抽取系统,用于Web信息的自动检索。在用于抽取规则归纳的框架下,还重点对用于生成抽取模式的WHISK学习算法进行了实验对比分析,结果表明系统对于单槽和多槽数据都具有不错的归纳学习能力。 相似文献
8.
抽取网页中的行情数据进行预测和分析具有重要意义。提出了Web中的行情数据抽取算法,该算法主要基于“行情数据通常在网页中表现为区域最大的数据表格”等实践规律,首先自动识别出最大的数据表格,然后转换为DOM树结构,最后抽取DOM树的结点值。与传统算法不同,算法自动抽取行情区域而无需用户定义抽取数据区域。设计了一个农产品价格预测原型系统,该系统针对某个农产品,自动从特定网站获取价格数据,对月度价格进行预测,实验表明预测性能较好。 相似文献
9.
基于改进直线Snake算法的建筑物自动提取 总被引:1,自引:0,他引:1
为研究航空影像中建筑物的自动、半自动提取,通过分析直线Snake算法,对其内部能量函数增加1个平均连通距离,修改二阶项,归一化外部能量函数,增加1个外部力,然后用改进后的直线Snake算法结合贪婪算法对建筑物进行提取.该方法能正确地自动提取建筑物.实验结果表明新算法可以提高提取效率. 相似文献
10.
目的 格式塔心理学的理论基础为通过对事物的部分感知,实现对事物整体的认识。本文将该思想应用到建筑物提取中,提出一种兼顾目标细节及整体几何特征的高分辨率遥感影像建筑物提取方法。方法 首先,利用SIFT算法提取特征点作为候选边缘点;然后定义格式塔序列连续性原则判别边缘点,从而得到边缘点点集;并由边缘点点集拟合边缘,实现遥感影像建筑物提取。结果 利用提出算法,对WorldView-2遥感影像进行建筑物提取实验。通过与基于多尺度分割和区域合并的建筑物提取算法对比可以看出,提出算法能够更加准确、完整地提取出建筑物。采用分支因子、遗漏因子、检测率和完整性4个定量化指标对实验结果的定量评价,本文算法的检测率和完整性均大于对比算法,且本文算法的检测率均在95%以上,验证了提出基于格式塔理论的高分辨率遥感影像建筑物提取算法的有效性和准确性。结论 基于格式塔的高分辨率遥感影像建筑物提取算法能够准确刻画建筑物细节特征,同时兼顾建筑物整体几何轮廓,准确提取高分辨率遥感影像中的建筑物。本文算法针对高分辨率遥感影像,适用于提取边缘具有直线特征的建筑物。使用本文算法进行遥感影像建筑物提取时,提取精度会随分辨率降低而降低,建议实验影像分辨率在5 m以上。 相似文献
11.
Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction programs from examples of lists and tables in html documents. Supervised approaches have been shown to achieve high accuracy, but they require manual labelling of training examples, which is also time consuming. Fully unsupervised approaches, which extract rows and columns by detecting regularities in the data, cannot provide sufficient accuracy for practical domains. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involving the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our method achieves high performance with minimal user input compared to fully supervised techniques. 相似文献
12.
13.
Feature extraction is an important component of a pattern recognition system. It performs two tasks: transforming input parameter vector into a feature vector and/or reducing its dimensionality. A well-defined feature extraction algorithm makes the classification process more effective and efficient. Two popular methods for feature extraction are linear discriminant analysis (LDA) and principal component analysis (PCA). In this paper, the minimum classification error (MCE) training algorithm (which was originally proposed for optimizing classifiers) is investigated for feature extraction. A generalized MCE (GMCE) training algorithm is proposed to mend the shortcomings of the MCE training algorithm. LDA, PCA, and MCE and GMCE algorithms extract features through linear transformation. Support vector machine (SVM) is a recently developed pattern classification algorithm, which uses non-linear kernel functions to achieve non-linear decision boundaries in the parametric space. In this paper, SVM is also investigated and compared to linear feature extraction algorithms. 相似文献
14.
网页数据自动抽取系统 总被引:6,自引:0,他引:6
在Internet中存在着大量的半结构化的HTML网页。为了使用这些丰富的网页数据,需要将这些数据从网页中重新抽取出来。该文介绍了一种新的基于树状结构的信息提取方法和一个自动产生包装器的系统DAE(DOMbasedAutomaticExtraction),将HTML网页数据转换为XML数据,在提取的过程中基本上不需要人工干预,因而实现了抽取过程的自动化。该方法可以应用于信息搜索agent中,或者应用于数据集成系统中等。 相似文献
15.
L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises 总被引:3,自引:1,他引:2 下载免费PDF全文
In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China. 相似文献
16.
本文针对深度神经网络对高分二号遥感影像道路提取时细节信息丢失较多、道路周围环境考虑不充分等情况, 在已有的研究成果上, 提出一种基于全卷积神经网络遥感影像道路提取的改进方案. 方案创新研究了全卷积神经网络的算法原理, 将预调色后的高分二号影像按一定尺寸分幅输出, 将输出图像及标签对应输入于以全卷积神经网络为基础的改进网络, 通过结合残差单元以及增加网络层数得到识别精度较高的道路提取图像. 实验表明, 该方法在同一样本中对高分二号卫星影像道路提取的效果有所提升, 道路的完整性和准确性有所提高. 相似文献
17.
This work proposes a novel adaptive approach for character segmentation and feature vector extraction from seriously degraded images. An algorithm based on the histogram automatically detects fragments and merges these fragments before segmenting the fragmented characters. A morphological thickening algorithm automatically locates reference lines for separating the overlapped characters. A morphological thinning algorithm and the segmentation cost calculation automatically determine the baseline for segmenting the connected characters. Basically, our approach can detect fragmented, overlapped, or connected character and adaptively apply for one of three algorithms without manual fine-tuning. Seriously degraded images as license plate images taken from real world are used in the experiments to evaluate the robustness, the flexibility and the effectiveness of our approach. The system approach output data as feature vectors keep useful information more accurately to be used as input data in an automatic pattern recognition system. 相似文献
18.
基于不变矩的高分辨率遥感图像建筑物提取方法 总被引:1,自引:0,他引:1
为了有效地对图像进行特征提取, 利用不变矩算法对IKONOS和WorldView两种高分辨率遥感图像的城市建筑物地区进行提取。首先将图像数据经过Canny边缘检测和标记分水岭分割, 然后在此基础上分别利用胡氏不变矩和仿射不变矩对图像进行特征提取; 最后通过实验结果的评价可以证明在建筑物的特征提取上, 仿射不变矩比胡氏不变矩的提取效果更加显著, 进而也证明了利用不变矩算法对高分辨率遥感图像建筑物特征提取这一方法是可行且有效的。 相似文献
19.
基于Web数据的本体概念抽取 总被引:1,自引:0,他引:1
本体论(Ontology)在知识管理及语义网(Semantic Web)中越来越重要,但建造本体往往需要耗费大量的时间,且建造完成后本体的维护对知识管理者来说也是费时的工作。自动创建领域Ontology可以克服手工方法的不足,成为当前的研究热点之一;而概念是本体中最重要的组成部分之一,从半结构化的Web文档中自动抽取概念的效率和准确度的高低,直接决定了自动建造的本体的质量,提出一种自动的本体概念抽取模型,此模型不依赖于领域词典或核心本体,并且能达到快速有效地通过对中文Web文本挖掘自动地构建及更新领域本体概念的目的。 相似文献
20.
Personal information extraction, which extracts the persons in question and their related information (such as biographical information and occupation) from web, is an important component to construct social network (a kind of semantic web). For this practical task, two important issues are to be discussed: personal named entity ambiguity and the extraction of personal information for a specific person. For personal named entity ambiguity, which is a common phenomenon in the fast growing web resource, we propose a robust system which extracts lightweight features with a totally unsupervised approach from broad resources. The experiments show that these lightweight features not only improve the performances, but also increase the robustness of a disambiguation system. To extract the information of the focus person, an integrated system is introduced, which is able to effectively re-use and combine current well-developed tools for web data, and at the same time, to identify the expression properties of web data. We show that our flexible extraction system achieves state-of-the-art performances, especially the high precision, which is very important for real applications. 相似文献