首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We show that a language of infinite binary trees is definable by a Σ2-formula of the monadic second order logic of two successors (with no additional symbols) iff it can be accepted by a Büchi automaton. The same result has been obtained by G. Lenzi, but our proof is simpler.  相似文献   

2.
The field of grammatical inference (also known as grammar induction) is transversal to a number of research areas including machine learning, formal language theory, syntactic and structural pattern recognition, computational linguistics, computational biology and speech recognition. There is no uniform literature on the subject and one can find many papers with original definitions or points of view. This makes research in this subject very hard, mainly for a beginner or someone who does not wish to become a specialist but just to find the most suitable ideas for his own research activity. The goal of this paper is to introduce a certain number of papers related with grammatical inference. Some of these papers are essential and should constitute a common background to research in the area, whereas others are specialized on particular problems or techniques, but can be of great help on specific tasks.  相似文献   

3.
Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.  相似文献   

4.
提出了一种利用改进的k-contextual树自动机推理算法的信息抽取技术。其核心思想是将结构化(半结构化)文档转换成树,然后利用一种改进的k-contextual树(KLH树)来构造出能够接受样本的无秩树自动机,依据该自动机接收和拒绝状态来确定是否抽取网页信息。该方法充分利用了网页文档的树状结构,依托树自动机将传统的以单一结构途径的信息抽取方法与文法推理原则相结合,得到信息抽取规则。实验证明,该方法与同类抽取方法相比,样本学习时间以及抽取所需时间上均有所缩短。  相似文献   

5.
We develop an algebraic language theory for languages of infinite trees. We define a class of algebras called ω-hyperclones and we show that a language of infinite trees is regular if, and only if, it is recognised by a finitary path-continuous ω-hyperclone.  相似文献   

6.
Web信息的自主抽取方法   总被引:12,自引:0,他引:12  
许建潮  侯锟 《计算机工程与应用》2005,41(14):185-189,198
提出了基于表格结构及列表结构的W eb页面信息自主抽取的方法。可根据用户对信息的需求自主地从相关页面中抽取信息并将抽取信息按关系模型进行重组存放在数据库中,对表格结构信息源仅需标注一页网页,即可获取抽取知识,通过自学习能够较好地适应网页信息的动态变化,实现信息的自动抽取。对列表结构信息源信息,通过对DOM树结构的分析,动态获得信息块在DOM层次结构中的路径,根据信息对象基本的抽取知识,获得信息对象值。采用自学习的方法以适应网页信息的动态变化。  相似文献   

7.
A method for inferring of tree automata from sample set of trees is presented. The procedure, which is based on the concept ofk-follower of a tree with respect to the sample tree set, produces a tree automaton capable of accepting all the sample trees as well as other trees similar in structure. The behavior of the inferred tree automaton for varying values of parameterk is also discussed.This work was supported in part by a Scientific Research Grant-In-Aid (Grant No. 57460129) from the Ministry of Education, Science and Culture, Japan.  相似文献   

8.
The complexity of various membership problems for tree automata on compressed trees is analyzed. Two compressed representations are considered: dags, which allow to share identical subtrees in a tree, and straight-line context-free tree grammars, which moreover allow to share identical intermediate parts in a tree. Several completeness results for the classes NL, P, and PSPACE are obtained. Finally, the complexity of the evaluation problem for (structural) XPath queries on trees that are compressed via straight-line context-free tree grammars is investigated.  相似文献   

9.
Most work on pattern mining focuses on simple data structures such as itemsets and sequences of itemsets. However, a lot of recent applications dealing with complex data like chemical compounds, protein structures, XML and Web log databases and social networks, require much more sophisticated data structures such as trees and graphs. In these contexts, interesting patterns involve not only frequent object values (labels) appearing in the graphs (or trees) but also frequent specific topologies found in these structures. Recently, several techniques for tree and graph mining have been proposed in the literature. In this paper, we focus on constraint-based tree pattern mining. We propose to use tree automata as a mechanism to specify user constraints over tree patterns. We present the algorithm CoBMiner which allows user constraints specified by a tree automata to be incorporated in the mining process. An extensive set of experiments executed over synthetic and real data (XML documents and Web usage logs) allows us to conclude that incorporating constraints during the mining process is far more effective than filtering the interesting patterns after the mining process.  相似文献   

10.
11.
12.
Grammatical inference – used successfully in a variety of fields such as pattern recognition, computational biology and natural language processing – is the process of automatically inferring a grammar by examining the sentences of an unknown language. Software engineering can also benefit from grammatical inference. Unlike these other fields, which use grammars as a convenient tool to model naturally occurring patterns, software engineering treats grammars as first-class objects typically created and maintained for a specific purpose by human designers. We introduce the theory of grammatical inference and review the state of the art as it relates to software engineering.  相似文献   

13.
一种基于分类算法的网页信息提取方法   总被引:3,自引:0,他引:3  
在目前的Web信息提取技术中,很多都是基于HTML结构的,由于HTML结构的经常变化,使提取模板需要经常更新,而提取模板的更新需要很多领域知识.本文提出一种基于分类算法的Web信息提取方法,通过将网页文本按照其显示属性的不同进行分组,以显示属性值为基础对Web页面文本进行分类,获取所关注文本,从而完成对web页面的信息提取.这种提取方法操作简单,易于实现,对网页结构的依赖性小.  相似文献   

14.
基于无秩树自动机的信息抽取技术研究   总被引:1,自引:0,他引:1  
针对目前基于网页结构的信息抽取方法的缺陷,提出了一种基于无秩树自动机的信息抽取技术,其核心思想是通过将结构化(半结构化)文档转换成无秩树,然后利用(k,l)-contextual树构造样本自动机,依据树自动机接收和拒绝状态来对网页进行数据的抽取.该方法充分利用结构,依托树自动机将传统的以单一结构途径的信息抽取方法与文法推理原则相结合,得到信息抽取规则.实验结果表明,该方法与同类抽取方法相比在准确率、召回率以及抽取所需时间上均有所提高.  相似文献   

15.
树和模板的文献信息提取方法研究*   总被引:1,自引:0,他引:1  
教师科研文献信息的自动搜集是科研成果有效管理的重要手段,将网页信息的提取方法用于网络数据库中文献信息的自动搜集有广大的应用前景。提出基于DOM树和模板的文献信息提取方法,利用HTML标记间的嵌套关系将Web网页表示成一棵DOM树,将DOM树结构用于网页相似度的度量和自动分类,相似度高的网页应用同一模板进行信息提取。实验结果表明该方法在提取网络数据库中文献信息的准确率在94%以上。  相似文献   

16.
基于HTML模式代数的Web信息提取方法   总被引:3,自引:0,他引:3  
高效地生成提取Web信息的包装器有着广阔的应用前景,同时也是至今没有得到有效解决的难题.为此,提出了基于HTML文档的模式代数,该代数包括一致模式集等重要概念以及模式的加法运算.在此基础上,提出了一种提取Web信息的新方法,该方法采用在整个训练例子中学习表示各属性提取规则的一致模式集,再由多个模式组成的一致模式集提取数据,适用于提取具有缺省属性、多值属性、属性具有多种不同顺序的表结构网页和层次结构网页,其有效性在原型系统中通过实验得到验证.  相似文献   

17.
Theory and algorithm for optimization of a directed and labeled tree are presented. Their application for optimizing any finite pattern grammar represented in the form of a tree is discussed. Tree optimization leads to loss information which is essential for identification of patterns. Special technique for preserving this information has been suggested.Finally, outlines of two different algorithms for the parsing of patterns are included. The tree parser uses the optimized tree and the table-driven parser uses the optimized syntax stored in four separate tables.  相似文献   

18.
An ever greater range of applications call for learning from sequences. Grammar induction is one prominent tool for sequence learning, it is therefore important to know its properties and limits. This paper presents a new type of analysis for inductive learning. A few years ago, the discovery of a phase transition phenomenon in inductive logic programming proved that fundamental characteristics of the learning problems may affect the very possibility of learning under very general conditions. We show that, in the case of grammatical inference, while there is no phase transition when considering the whole hypothesis space, there is a much more severe “gap” phenomenon affecting the effective search space of standard grammatical induction algorithms for deterministic finite automata (DFA). Focusing on standard search heuristics, we show that they overcome this difficulty to some extent, but that they are subject to overgeneralization. The paper last suggests some directions to alleviate this problem.
Michèle SebagEmail:
  相似文献   

19.
Recently Clark and Eyraud (2007) [10] have shown that substitutable context-free languages, which capture an aspect of natural language phenomena, are efficiently identifiable in the limit from positive data. Generalizing their work, this paper presents a polynomial-time learning algorithm for new subclasses of multiple context-free languages with variants of substitutability.  相似文献   

20.
Web数据抽取技术研究进展   总被引:8,自引:0,他引:8  
由于Web上存在着大量有用而复杂的信息,近年来学术界和企业界开发了许多从Web中抽取数据的方法和工具。本文总结了Web数据抽取技术的研究进展和从Web中抽取数据的主要原理、过程、方法和抽取规则,并讨论了未来的研究方向。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号