共查询到18条相似文献,搜索用时 281 毫秒
1.
2.
3.
4.
5.
6.
MD5算法在消除重复网页算法中的应用 总被引:1,自引:0,他引:1
Internet用户通过常用搜索引擎获取Web信息时,往往得到了大量的重复网页信息,从而导致搜索效率不高。本文利用MD5算法成熟及可移植性好的特点,提出了一种基于MD5的消除重复网页的算法,实验证明该算法能有效的去除重复网页,时间和空间的复杂度不高,具有较强的实用价值。 相似文献
7.
Internet用户通过常用搜索引擎获取Web信息时,往往得到了大量的重复网页信息,从而导致搜索效率不高.本文利用MD5算法成熟及可移植性好的特点,提出了一种基于MD5的消除重复网页的算法,实验证明该算法能有效的去除重复网页,时间和空间的复杂度不高,具有较强的实用价值. 相似文献
8.
研究网页查重问题.针对传统的SCAM网页查重算法根据比较几个关键词网页中出现次数来判断网页是否重复,当网站中存在相似网页时,由于其关键词非常相近,导致出现误判,造成查重准确率不高的问题.本文提出一种网页指纹查重算法,通过采用信息检索技术,提取出待检测网页的网页指纹,然后通过与网页库中的网页指纹比较判决,完成网页的查重,避免了传统方法只依靠几个关键词而造成的查重准确率不高的问题.实验证明,这种利用网页指纹查重的方法能准确判断网页是否重复,提高了网页信息的准确性,取得了满意的结果. 相似文献
9.
互联网中存在着大量的重复网页,在进行信息检索或大规模网页采集时,网页去重是提高效率的关键之一。本文在研究"指纹"或特征码等网页去重算法的基础上,提出了一种基于编辑距离的网页去重算法,通过计算网页指纹序列的编辑距离得到网页之间的相似度。它克服了"指纹"或特征码这类算法没有兼顾网页正文结构的缺点,同时从网页内容和正文结构上进行比较,使得网页重复的判断更加准确。实验证明,该算法是有效的,去重的准确率和召回率都比较高。 相似文献
10.
基于CSS类选择符重复引入的网页信息隐藏算法 总被引:1,自引:0,他引:1
现有的网页信息隐藏算法存在信息隐藏点与网页内容分离、抗机器过滤能力较差的缺点.基于CSS类选择符重复引入策略,提出了一种新的网页信息隐藏算法.按照嵌入规则,采用重复引入可操作CSS块中相关对象的CSS类选择符的方法来嵌入信息.实验结果表明该算法信息隐藏点与网页内容紧密地结合,提高了抗检测和抵抗机器过滤的能力,且具有较好的隐蔽性,能够隐藏较大量的信息,可以应用于网页保护和隐秘通信. 相似文献
11.
动态超文本标志语言(DHTML)并非仅仅是增强了网页中的动画效果,更重要的是它将面向对象(object-oriented)的程序设计思想明确地引入到了网页的设计中,从而使网页的设计更灵活,更具动态和交互特性。文中结合示例详细介绍了如何运用面向对象思想设计功能完善、独具特色的网页。 相似文献
12.
13.
Lakshmish Ramaswamy Arun lyengar Liu L. Douglis F. 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(6):859-874
Constructing Web pages from fragments has been shown to provide significant benefits for both content generation and caching. In order for a Web site to use fragment-based content generation, however, good methods are needed for fragmenting the Web pages. Manual fragmentation of Web pages is expensive, error prone, and unscalable. This paper proposes a novel scheme to automatically detect and flag fragments that are cost-effective cache units in Web sites serving dynamic content. Our approach analyzes Web pages with respect to their information sharing behavior, personalization characteristics, and change patterns. We identify fragments which are shared among multiple documents or have different lifetime or personalization characteristics. Our approach has three unique features. First, we propose a framework for fragment detection, which includes a hierarchical and fragment-aware model for dynamic Web pages and a compact and effective data structure for fragment detection. Second, we present an efficient algorithm to detect maximal fragments that are shared among multiple documents. Third, we develop a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. This paper shows the results when the algorithms are applied to real Web sites. We evaluate the proposed scheme through a series of experiments, showing the benefits and costs of the algorithms. We also study the impact of using the fragments detected by our system on key parameters such as disk space utilization, network bandwidth consumption, and load on the origin servers. 相似文献
14.
为从大量的复杂非规范网页结构中自动抽取出新闻标题,该文提出一种基于密度和文本特征的新闻标题抽取算法(title extraction with density and text-features, TEDT)。主要通过融合网页文本密度分布和语言特征的语料判定模型,将网页划分为语料区和标题候选区,选取语料后通过TextRank算法计算对应的key-value权重集合,最后采用改进的相似度计算方法从标题候选区抽取新闻标题。该算法能有效划分语料和标题区域,降低网页噪声干扰,准确抽取出新闻标题。实验结果表明,TEDT的准确率和召回率均优于传统的基于规则和相似度的新闻标题抽取算法,证明了TEDT不仅对主流新闻网站有效,而且对复杂非规范网页也广泛适用。 相似文献
15.
16.
Hu W Wu O Chen Z Fu Z Maybank S 《IEEE transactions on pattern analysis and machine intelligence》2007,29(6):1019-1034
With the rapid development of the World Wide Web, people benefit more and more from the sharing of information. However, Web pages with obscene, harmful, or illegal content can be easily accessed. It is important to recognize such unsuitable, offensive, or pornographic Web pages. In this paper, a novel framework for recognizing pornographic Web pages is described. A C4.5 decision tree is used to divide Web pages, according to content representations, into continuous text pages, discrete text pages, and image pages. These three categories of Web pages are handled, respectively, by a continuous text classifier, a discrete text classifier, and an algorithm that fuses the results from the image classifier and the discrete text classifier. In the continuous text classifier, statistical and semantic features are used to recognize pornographic texts. In the discrete text classifier, the naive Bayes rule is used to calculate the probability that a discrete text is pornographic. In the image classifier, the object's contour-based features are extracted to recognize pornographic images. In the text and image fusion algorithm, the Bayes theory is used to combine the recognition results from images and texts. Experimental results demonstrate that the continuous text classifier outperforms the traditional keyword-statistics-based classifier, the contour-based image classifier outperforms the traditional skin-region-based image classifier, the results obtained by our fusion algorithm outperform those by either of the individual classifiers, and our framework can be adapted to different categories of Web pages 相似文献
17.
针对重复网页的去重问题,对两种重复词句提取算法进行了系统分析比较。STC算法在时间成本上具有优秀性能,重复序列的倒排索引方法在空间复杂度方面更胜一筹。结合STC算法对重复序列方法进行了改进,而面向主题转载的重复网页,先抽取重复串,然后将重复串作索引进行STC算法的重复抽取。实验结果表明,改进算法在保持了原有空间特性的基础上极大地提高了时间效率。 相似文献
18.
Web信息抽取引发了大规模的应用。基于包装器的Web信息抽取有两个研究领域:包装器产生和包装器平衡,提出了一种新的包装器自动平衡算法。它基于以下的观察:尽管页面有多种多样的变化方式,但是许多重要的页面特征在新页面都得到了保存,例如文本模式、注释信息和超级链接。新的算法能充分利用这些保存下来的页面特征在变化的页面中定位目标信息,并能自动修复失效的包装器。对实际Web站点信息抽取的实验表明,新的算法能有效地维持包装器的平衡以便更精确地抽取信息。 相似文献