首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
The Semantic Web envisions a World Wide Web in which data is described with rich semantics and applications can pose complex queries. To this point, researchers have defined new languages for specifying meanings for concepts and developed techniques for reasoning about them, using RDF as the data model. To flourish, the Semantic Web needs to provide interoperability—both between sites with different terminologies and with existing data and the applications operating on them. To achieve this, we are faced with two problems. First, most of the world’s data is available not in RDF but in XML; XML and the applications consuming it rely not only on the domain structure of the data, but also on its document structure. Hence, to provide interoperability between such sources, we must map between both their domain structures and their document structures. Second, data management practitioners often prefer to exchange data through local point-to-point data translations, rather than mapping to common mediated schemas or ontologies.This paper describes the Piazza system, which addresses these challenges. Piazza offers a language for mediating between data sources on the Semantic Web, and it maps both the domain structure and document structure. Piazza also enables interoperation of XML data with RDF data that is accompanied by rich OWL ontologies. Mappings in Piazza are provided at a local scale between small sets of nodes, and our query answering algorithm is able to chain sets mappings together to obtain relevant data from across the Piazza network. We also describe an implemented scenario in Piazza and the lessons we learned from it.  相似文献   

2.
A Survey of Web Information Extraction Systems   总被引:12,自引:0,他引:12  
The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches.  相似文献   

3.
An effective solution to automate information extraction from Web pages is represented by wrappers. A wrapper associates a Web page with an XML document that represents part of the information in that page in a machine-readable format. Most existing wrapping approaches have traditionally focused on how to generate extraction rules, while they have ignored potential benefits deriving from the use of the schema of the information being extracted in the wrapper evaluation. In this paper, we investigate how the schema of extracted information can be effectively used in both the design and evaluation of a Web wrapper. We define a clean declarative semantics for schema-based wrappers by introducing the notion of (preferred) extraction model, which is essential to compute a valid XML document containing the information extracted from a Web page. We developed the SCRAP (SChema-based wRAPper for web data) system for the proposed schema-based wrapping approach, which also provides visual support tools to the wrapper designer. Moreover, we present a wrapper generalization framework to profitably speed up the design of schema-based wrappers. Experimental evaluation has shown that SCRAP wrappers are not only able to successfully extract the required data, but also they are robust to changes that may occur in the source Web pages.  相似文献   

4.
Named entity recognition (NER) is the core part of information extraction that facilitates the automatic detection and classification of entities in natural language text into predefined categories, such as the names of persons, organizations, locations, and so on. The output of the NER task is crucial for many applications, including relation extraction, textual entailment, machine translation, information retrieval, etc. Literature shows that machine learning and deep learning approaches are the most widely used techniques for NER. However, for entity extraction, the abovementioned approaches demand the availability of a domain‐specific annotated data set. Our goal is to develop a hybrid NER system composed of rule‐based deep learning as well as clustering‐based approaches, which facilitates the extraction of generic entities (such as person, location, and organization) out of natural language texts of domains that lack generic named entities labeled domain data sets. The proposed approach takes the advantages of both deep learning and clustering approaches but separately, in combination with a knowledge‐based approach by using a postprocessing module. We evaluated the proposed methodology on court cases (judgments) as a use case since it contains generic named entities of different forms that are poorly or not present in open‐source NER data sets. We also evaluated our hybrid models on two benchmark data sets, namely, Computational Natural Language Learning (CoNLL) 2003 and Open Knowledge Extraction (OKE) 2016. The experimental results obtained from benchmark data sets show that our hybrid models achieved substantially better performance in terms of the F‐score in comparison to other competitive systems.  相似文献   

5.
为了解决已有信息抽取系统中方法不具有重用性及不能抽取语义信息的问题,提出了一个基于领域本体的面向主题的Web信息抽取框架.对Web中文页面,借助外部资料,利用本体解析信息,对文件采集及预处理中的源文档及信息采集、文档预处理、文档存储等技术进行了分析设计,提出了文本转换中的分词及词表查询和命名实体识别算法,并给出了一种知识抽取方案.实验结果表明,该方法可以得到性能较高的抽取结果.  相似文献   

6.
We address the visual categorization problem and present a method that utilizes weakly labeled data from other visual domains as the auxiliary source data for enhancing the original learning system. The proposed method aims to expand the intra-class diversity of original training data through the collaboration with the source data. In order to bring the original target domain data and the auxiliary source domain data into the same feature space, we introduce a weakly-supervised cross-domain dictionary learning method, which learns a reconstructive, discriminative and domain-adaptive dictionary pair and the corresponding classifier parameters without using any prior information. Such a method operates at a high level, and it can be applied to different cross-domain applications. To build up the auxiliary domain data, we manually collect images from Web pages, and select human actions of specific categories from a different dataset. The proposed method is evaluated for human action recognition, image classification and event recognition tasks on the UCF YouTube dataset, the Caltech101/256 datasets and the Kodak dataset, respectively, achieving outstanding results.  相似文献   

7.
Learning Image-Text Associations   总被引:2,自引:0,他引:2  
Web information fusion can be defined as the problem of collating and tracking information related to specific topics on the World Wide Web. Whereas most existing work on Web information fusion has focused on text-based multidocument summarization, this paper concerns the topic of image and text association, a cornerstone of cross-media Web information fusion. Specifically, we present two learning methods for discovering the underlying associations between images and texts based on small training data sets. The first method based on vague transformation measures the information similarity between the visual features and the textual features through a set of predefined domain-specific information categories. Another method uses a neural network to learn direct mapping between the visual and textual features by automatically and incrementally summarizing the associated features into a set of information templates. Despite their distinct approaches, our experimental results on a terrorist domain document set show that both methods are capable of learning associations between images and texts from a small training data set.  相似文献   

8.
We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction. We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particular class of tree automata that we introduce. We prove that deterministic NSTTs capture the class of queries definable in monadic second order logic (MSO) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction, and prove that monadic queries defined by NSTTs can be answered efficiently. We present a new polynomial time algorithm in RPNI-style that learns monadic queries defined by deterministic NSTTs from completely annotated examples, where all selected nodes are distinguished. In practice, users prefer to provide partial annotations. We propose to account for partial annotations by intelligent tree pruning heuristics. We introduce pruning NSTTs—a formalism that shares many advantages of NSTTs. This leads us to an interactive learning algorithm for monadic queries defined by pruning NSTTs, which satisfies a new formal active learning model in the style of Angluin (1987). We have implemented our interactive learning algorithm integrated it into a visually interactive Web information extraction system—called SQUIRREL—by plugging it into the Mozilla Web browser. Experiments on realistic Web documents confirm excellent quality with very few user interactions during wrapper induction. Editor: Georgios Paliouras and Yasubumi Sakakibara  相似文献   

9.
Taylor  S.M. 《IT Professional》2004,6(6):28-34
Most readily available tools - basic search engines, possibly a news or information service, and perhaps agents and Web crawlers - are inadequate for many information retrieval tasks and downright dangerous for others. These tools either return too much useless material or miss important material. Even when such tools find useful information, the data is still in a text form that makes it difficult to build displays or diagrams. Employing the data in data mining or standard database operations, such as sorting and counting, can also be difficult. An emerging technology called information extraction (IE) is beginning to change all that, and you might already be using some very basic IE tools without even knowing it. Companies are increasingly applying IE behind the scenes to improve information and knowledge management applications such as text search, text categorization, data mining, and visualization (Rao, 2003). IE has also begun playing a key role in fields such as national security, law enforcement, insurance, and biomedical research, which have highly critical information and knowledge needs. In these fields, IE's powerful capabilities arc necessary to save lives or substantial investments of time and money. IE views language up close, considering grammar and vocabulary, and tries to determine the details of "who did what to whom" from a piece of text. In its most in-depth applications, IE is domain focused; it does not try to define all the events or relationships present in a piece of text, but focuses only on items of particular interest to the user organization.  相似文献   

10.
由于网页信息具有异构和动态的特点,致使现有的大多数网页信息抽取方法都存在适用性差的问题。为此,将传统的文本分类器和隐式马尔可夫学习策略结合起来,提出了一种基于多学习策略的网页信息抽取方法。该方法在获得网页文本记录的局部最优分类抽取结果基础上,还利用了整个网页文本结构信息对抽取结果进行进一步优化。实验结果表明,该方法不需要对新的站点进行学习,就能获得较高的信息召回率和抽取精度,具有较强的适用性。  相似文献   

11.
12.
Most Web content categorization methods are based on the vector space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance‐based and model‐based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the markup information that can easily be extracted from the Web document HTML tags. A recently developed graph‐based Web document representation model can preserve Web document structural information. It was shown to outperform the traditional vector representation using the k‐Nearest Neighbor (k‐NN) classification algorithm. The problem, however, is that the eager (model‐based) classifiers cannot work with this representation directly. In this article, three new hybrid approaches to Web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector‐based models using the C4.5 decision tree and the probabilistic Naïve Bayes classifiers on several benchmark Web document collections. The results demonstrate that the hybrid methods presented in this article outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time. © 2008 Wiley Periodicals, Inc.  相似文献   

13.
在新闻领域标注语料上训练的中文分词系统在跨领域时性能会有明显下降。针对目标领域的大规模标注语料难以获取的问题,该文提出Active learning算法与n-gram统计特征相结合的领域自适应方法。该方法通过对目标领域文本与已有标注语料的差异进行统计分析,选择含有最多未标记过的语言现象的小规模语料优先进行人工标注,然后再结合大规模文本中的n-gram统计特征训练目标领域的分词系统。该文采用了CRF训练模型,并在100万句的科技文献领域上,验证了所提方法的有效性,评测数据为人工标注的300句科技文献语料。实验结果显示,在科技文献测试语料上,基于Active Learning训练的分词系统在各项评测指标上均有提高。
  相似文献   

14.
Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describes a learning-based technique for semantic analysis of schematic data which are characterized by being template-generated from backend databases. Starting with a seed set of hand-labeled instances of semantic concepts in a set of Web pages, the technique learns statistical models of these concepts using light-weight content features. These models direct the annotation of diverse Web pages possessing similar content semantics. The principles behind the technique find application in information retrieval and extraction problems. Focused Web browsing activities require only selective fragments of particular Web pages but are often performed using bookmarks which fetch the contents of the entire page. This results in information overload for users of constrained interaction modality devices such as small-screen handheld devices. Fine-grained information extraction from Web pages, which are typically performed using page specific and syntactic expressions known as wrappers, suffer from lack of scalability and robustness. We report on the application of our technique in developing semantic bookmarks for retrieving targeted browsing content and semantic wrappers for robust and scalable information extraction from Web pages sharing a semantic domain. This work has been conducted while the author was at Stony Brook University.  相似文献   

15.
This paper describes our research into a query-by-semantics approach to searching the World Wide Web. This research extends existing work, which had focused on a query-by-structure approach for the Web. We present a system that allows users to request documents containing not only specific content information, but also to specify that documents be of a certain type. The system captures and utilizes structure information as well as content during a distributed query of the Web. The system also allows the user the option of creating their own document types by providing the system with example documents. In addition, although the system still gives users the option of dynamically querying the web, the incorporation of a document database has improved the response time involved in the search process. Based on extensive testing and validation presented herein, it is clear that a system that incorporates structure and document semantic information into the query process can significantly improve search results over the standard keyword search.  相似文献   

16.
17.
To appropriately utilize the rapidly growing amount of data and information is a big challenge for people and organizations. Standard information retrieval methods, using sequential processing combined with syntax-based indexing and access methods, have not been able to adequately handle this problem. We are currently investigating a different approach, based on a combination of massive parallel processing with case-based (memory-based) reasoning methods. Given the problems of purely syntax-based retrieval methods, we suggest ways of incorporating general domain knowledge into memory-based reasoning. Our approach is related to the properties of the parallel processing microchip MS160, particularly targeted at fast information retrieval from very large data sets. Within this framework different memory-based methods are studied, differing in the type and representation of cases, and in the way that the retrieval methods are supported by explicit general domain knowledge. Cases can be explicitly stored information retrieval episodes, virtually stored abstractions linked to document records, or merely the document records themselves. General domain knowledge can be a multi-relational semantic network, a set of term dependencies and relevances, or compiled into a global similarity metric. This paper presents the general framework, discusses the core issues involved, and describes three different methods illustrated by examples from the domain of medical diagnosis.  相似文献   

18.
One of the key elements of the Semantic Web technologies is domain ontologies and those ontologies are important constructs for multi-agent system. The Semantic Web relies on domain ontologies that structure underlying data enabling comprehensive and transportable machine understanding. It takes so much time and efforts to construct domain ontologies because these ontologies can be manually made by domain experts and knowledge engineers. To solve these problems, there have been many researches to semi-automatically construct ontologies. Most of the researches focused on relation extraction part but manually selected terms for ontologies. These researches have some problems. In this paper, we propose a hybrid method to extract relations from domain documents which combines a named relation approach and an unnamed relation approach. Our named relation approach is based on the Hearst’s pattern and the Snowball system. We merge a generalized pattern scheme into their methods. In our unnamed relation approach, we extract unnamed relations using association rules and clustering method. Moreover, we recommend candidate relation names of unnamed relations. We evaluate our proposed method by using Ziff document set offered by TREC.  相似文献   

19.
As the internet grows rapidly, millions of web pages are being added on a daily basis. The extraction of precise information is becoming more and more difficult as the volume of data on the internet increases. Several search engines and information fetching tools are available on the internet, all of which claim to provide the best crawling facilities. For the most part, these search engines are keyword based. This poses a problem for visually impaired people who want to get the full use from online resources available to other users. Visually impaired users require special aid to get?along with any given computer system. Interface and content management are no exception, and special tools are required to facilitate the extraction of relevant information from the internet for visually impaired users. The HOIEV (Heavyweight Ontology Based Information Extraction for Visually impaired User) architecture provides a mechanism for highly precise information extraction using heavyweight ontology and built-in vocal command system for visually impaired internet users. Our prototype intelligent system not only integrates and communicates among different tools, such as voice command parsers, domain ontology extractors and short message engines, but also introduces an autonomous mechanism of information extraction (IE) using heavyweight ontology. In this research we designed domain specific heavyweight ontology using OWL 2 (Web Ontology Language 2) and for axiom writing we used PAL (Protégé Axiom Language). We introduced a novel autonomous mechanism for IE by developing prototype software. A series of experiments were designed for the testing and analysis of the performance of heavyweight ontology in general, and our information extraction prototype specifically.  相似文献   

20.
The Web as a global information space is developing from a Web of documents to a Web of data. This development opens new ways for addressing complex information needs. Search is no longer limited to matching keywords against documents, but instead complex information needs can be expressed in a structured way, with precise answers as results. In this paper, we present Hermes, an infrastructure for data Web search that addresses a number of challenges involved in realizing search on the data Web. To provide an end-user oriented interface, we support expressive user information needs by translating keywords into structured queries. We integrate heterogeneous Web data sources with automatically computed mappings. Schema-level mappings are exploited in constructing structured queries against the integrated schema. These structured queries are decomposed into queries against the local Web data sources, which are then processed in a distributed way. Finally, heterogeneous result sets are combined using an algorithm called map join, making use of data-level mappings. In evaluation experiments with real life data sets from the data Web, we show the practicability and scalability of the Hermes infrastructure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号