首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 500 毫秒
1.
Existing multimedia document models like HTML, MHEG, SMIL, and HyTime lack appropriate modeling primitives to fit the needs of next generation multimedia applications which bring up requirements like reusability of multimedia content in different presentations and contexts, and adaptation to user preferences. In this paper, we motivate and present new requirements stemming from advanced multimedia applications and the resulting consequences for multimedia document models. Along these requirements, we discuss the document model standards HTML, HyTime, MHEG, SMIL, and ZYX, a new model that has been developed with special focus on reusability and adaptation. The analysis and comparison of the models show the limitations of existing models, point the way to the need for new flexible multimedia document models, and throw light on the many implications on authoring systems, multimedia content management, and presentation.  相似文献   

2.
Internet上的化学数据库是宝贵的化学信息资源,如何有效地利用这些数据是化学深层网所要解决的问题。本文总结了化学深层网的特点,基于XML技术实现从数据库检索返回的半结构化HTML页面中提取数据的目标,使之成为可供程序直接调用做进一步计算的数据。在数据提取过程中,先采用JTidy规范化HTML,得到格式上完整、内容无误的XHTML文档,利用包含着XPath路径语言的XSLT数据转换模板实现数据转换和提取。其中XPath表达式的优劣决定了XSLT数据转换模板能否长久有效地提取化学数据,文中着重介绍了如何编辑健壮的XPath表达式,强调了XPath表达式应利用内容和属性特征实现对源树中数据的定位,并尽可能地降低表达式之间的耦合度,前瞻性地预测化学站点可能出现的变化并在XSLT数据转换模板中采取相应的措施以提高表达式的长期有效性。为创建化学深层网数据提取的XSLT数据提取模板提供方法指导。  相似文献   

3.
Tremendous increase in user-generated content (UGC) published over the web in the form of natural language has posed a formidable challenge to automated information extraction (IE) and content analysis (CA). Techniques based on tree kernels (TK) have been successfully used for modelling semantic compositionality in many natural language processing (NLP) applications. Essentially, these techniques obtain the similarity of two production rules based on exact string comparison between the peer nodes. However, semantically identical tree fragments are forbidden even though they can contribute to the similarity of two trees. A mechanism needs to be addressed that accounts for the similarity of rules with varied syntax and vocabulary holding knowledge that are relatively analogous. In this paper, a hierarchical framework based on document object model (DOM) tree and linguistic kernels that jointly address subjectivity detection, opinion extraction and polarity classification is addressed. The model proceeds in three stages: during first stage, the contents of each DOM tree node is analysed to estimate the complexity of vocabulary and syntax using readability test. In second stage, the semantic tree kernels extended with word embeddings are used to classify nodes containing subjective and objective content. Finally, the content returned to be subjective is further examined for opinion polarity classification using fine-grained linguistic kernels. The efficiency of the proposed model is demonstrated through a series of experiments being conducted. The results reveal that the proposed polarity-enriched tree kernel (PETK) results in better prediction performance compared to the conventional tree kernels.  相似文献   

4.
信息抽取技术是一种广泛运用于互联网的数据挖掘技术。其目的是从互联网海量数据中抽取有意义、有价值的数据和信息,从而能更好的利用互联网资源。文中采用一种统计网页特征的方法,将中文网页中的正文部分抽取出来。该方法首先将网页表示成基于XML的DOM树形式,利用统计的节点信息从树中过滤掉噪音数据节点,最后再选取正文节点。该方法相比传统的基于包装器的抽取方法,具有简单,实用的特点,试验结果表明,该抽取方法准确率达到90%以上,具有很好的实用价值。  相似文献   

5.
提出了一种剪枝信息熵增较大结点的信息抽取方法。通过对HTML文档解析来构造DOM树。根据配置过滤掉不需处理的相关内容并建立语义模型树,最后对熵增超过阈值的结点进行剪枝并输出抽取的主题信息页面。初步实验结果验证了用这种方法进行Web页面信息抽取的有效性。方法的数学模型简单可靠,基本不需要人工干预即可完成主题信息抽取。可应用于Web数据挖掘系统以及PDA等移动设备的信息获取方面。  相似文献   

6.
目前网页标题的抽取方法大多结合HTML结构和标签特征进行抽取,但是这些方法并没有考虑标题与正文信息之间内容上的联系。该文提出一种基于相似度的网页标题抽取方法,该方法利用网页标题与正文信息之间的关系,通过计算语言“单位”之间的相似度和对应的权值,并引入HITS算法模型对权值进行调整,根据特定的选取方法抽取出真实标题。实验结果表明,该方法不仅对“非标准网页”的抽取达到满意的效果,而且对“标准网页”具有较高的泛化能力。  相似文献   

7.
The diffusion of the World Wide Web (WWW) and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) constitues a common basis for generating documents over the internet and the intranets. By means of the HTML the author is allowed to organize the text into subparts delimited by special tags; these subparts are then visualized by the HTML browser in distinct ways, i.e. with distinct typographical formats. In this paper a model for indexing HTML documents is proposed which exploits the role of tags in encoding the importance of their delimited text. Central to our model is a method to compute the significance degree of a term in a document by weighting the term instances according to the tags in which they occur. The indexing model proposed is based on a contextual weighted representation of the document under consideration, by means of which a set of (normalized) numerical weights is assigned to the various tags containing the text. The weighted representation is contextual in the sense that the set of numerical weights assigned to the various tags and the respective text depend (other than on the tags themselves) on the particular document considered. By means of the contextual weighted representation our indexing model reflects not only the general syntactic structure of the HTML language but also the information conveyed by the particular way in which the author instantiates that general structure in the document under consideration. We discuss two different forms of contextual weighting: the first is based on a linear weighted representation and is closer to the standard model of universal (i.e. non contextual) weighting; the second is based on a more complex non linear weighted representation and has a number of novel and interesting features.  相似文献   

8.
提出一种改进的树匹配算法,通过考量HTML特性,对树编辑距离方法进行改进,根据不同HTML树结点在浏览器中所显示的相关数据的不同权重赋以不同的权重值。算法由HTML数据对象构造具有结点权重的HTML树,模式识别通过取得两棵构造树的最大映射值达成。通过基于商用网站的实验对算法有效性进行了证实。  相似文献   

9.
赵赛  陈松乔  邓莎莎 《微机发展》2006,16(6):242-244
在基于Web数据集成的3点研究假设的前提下,探讨了基于规则树的包装器(Wrapper)生成模型。它包括预处理、生成HTML树、生成模式树、获取映射规则、生成规则树、修复规则树和执行Wrapper。详细介绍了该系统中映射规则的实现和规则树生成算法。通过实验测试,证明该方法适合Web数据的抽取。  相似文献   

10.
近年来,恶意网页检测主要依赖于语义分析或代码模拟执行来提取特征,但是这类方法实现复杂,需要高额的计算开销,并且增加了攻击面.为此,提出了一种基于深度学习的恶意网页检测方法,首先使用简单的正则表达式直接从静态HTML文档中提取与语义无关的标记,然后采用神经网络模型捕获文档在多个分层空间尺度上的局部性表示,实现了能够从任意长度的网页中快速找到微小恶意代码片段的能力.将该方法与多种基线模型和简化模型进行对比实验,结果表明该方法在0.1%的误报率下实现了96.4%的检测率,获得了更好的分类准确率.本方法的速度和准确性使其适合部署到端点、防火墙和Web代理中.  相似文献   

11.
基于DOM的网页主题信息自动提取   总被引:43,自引:0,他引:43  
Web页面所表达的主要信息通常隐藏在大量无关的结构和文字中,使用户不能迅速获取主题信息,限制了Web的可用性,信息提取有助于解决这一问题.基于DOM规范,针对HTML的半结构化特征和缺乏语义描述的不足,提出含有语义信息的STU-DOM树模型.将HTML文档转换为STU-DOM树,并对其进行基于结构的过滤和基于语义的剪枝,能够准确地提取出主题信息.方法不依赖于信息源,而且不改变源网页的结构和内容,是一种自动、可靠和通用的方法.具有可观的应用价值,可应用于PAD和手机上的web浏览以及信息检索系统.  相似文献   

12.
基于统计的网页正文信息抽取方法的研究   总被引:47,自引:6,他引:47  
为了把自然语言处理技术有效的运用到网页文档中,本文提出了一种依靠统计信息,从中文新闻类网页中抽取正文内容的方法。该方法先根据网页中的HTML 标记把网页表示成一棵树,然后利用树中每个结点包含的中文字符数从中选择包含正文信息的结点。该方法克服了传统的网页内容抽取方法需要针对不同的数据源构造不同的包装器的缺点,具有简单、准确的特点,试验表明该方法的抽取准确率可以达到95%以上。采用该方法实现的网页文本抽取工具目前为一个面向旅游领域的问答系统提供语料支持,很好的满足了问答系统的需求。  相似文献   

13.
Web正文信息抽取是信息检索、文本挖掘等Web信息处理工作的基础。在统计分析了主题网页的正文特征及结构特征的基础上,提出了一种结合网页正文信息特征及HTML标签特点的主题网页正文信息抽取方法。在将Web页面解析成DOM树的基础上,根据页面DOM树结构获取正文信息块,分析正文信息块块内噪音信息的特点,去除块内噪音信息。实验证明,这种方法具有很好的准确率及召回率。  相似文献   

14.
为从大量无关信息中获取有用内容,正文抽取成为Web数据应用不可或缺的组成部分。文中提出一种基于文本密度模型的新闻网页正文抽取方法。主要通过融合网页结构和语言特征的统计模型,将网页文档按文本行转化成正、负密度序列,再根据邻近行的内容连续性,利用高斯平滑技术修正文本密度序列,最后采用改进的最大子序列分割序列抽取正文内容。该方法保持正文完整性并排除噪声干扰,且无需人工干预或反复训练。实验结果表明基于文本密度抽取正文对不同数据源具有广泛的适应性,且准确率和召回率优于现有统计模型。  相似文献   

15.
16.
为提高XML文档的查询效率,提出一种基于倒排表与B+树的联合索引技术。DTD结构索引和内容索引采用倒排表作为索引单位,XML文档索引使用B+树作为索引基本组织。在DTD结构索引的结点编码中设置标识信息,便于确定需要查询的文档。通过建立DTD结构索引、XML文档索引和内容索引,实现混合型XML文档的查询。理论分析与实验结果表明,该技术具有较小的空间开销和较高的查询效率。  相似文献   

17.
In this paper, a new hybrid classifier is proposed by combining neural network and direct fractional-linear discriminant analysis (DF-LDA). The proposed hybrid classifier, neural tree with linear discriminant analysis called NTLD, adopts a tree structure containing either a simple perceptron or a linear discriminant at each node. The weakly performing perceptron nodes are replaced with DF-LDA in an automatic way. Taking the advantage of this node substitution, the tree building process converges faster and avoids the over-fitting of complex training sets in training process resulting a shallower tree together with better classification performance. The proposed NTLD algorithm is tested on various synthetic and real datasets. The experimental results show that the proposed NTLD leads to very satisfactory results in terms of tree depth reduction as well as classification accuracy.  相似文献   

18.
Most Web content categorization methods are based on the vector space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance‐based and model‐based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the markup information that can easily be extracted from the Web document HTML tags. A recently developed graph‐based Web document representation model can preserve Web document structural information. It was shown to outperform the traditional vector representation using the k‐Nearest Neighbor (k‐NN) classification algorithm. The problem, however, is that the eager (model‐based) classifiers cannot work with this representation directly. In this article, three new hybrid approaches to Web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector‐based models using the C4.5 decision tree and the probabilistic Naïve Bayes classifiers on several benchmark Web document collections. The results demonstrate that the hybrid methods presented in this article outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time. © 2008 Wiley Periodicals, Inc.  相似文献   

19.
关键短语抽取,即从文档中抽取能够表达文档主题和内容的关键短语集合,对于信息检索和文档分类等文本处理任务具有重要意义。然而,现有文献缺乏针对中文特点的关键短语抽取算法的研究。为此,该文提出了一种半监督式中文关键短语抽取模型,该模型采用预训练语言模型来表征短语及文章,以减少算法对大量标注训练数据的依赖;进而提出图模型描述候选短语间的相似性空间并迭代计算各短语的重要度;同时结合了多项统计特征来进一步提高短语评估的准确率。对比实验表明,该文提出的方法在中文关键短语抽取方面比基线方法具有明显的提升效果。  相似文献   

20.
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute–value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute–value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号