首页 | 官方网站   微博 | 高级检索  
     

基于网页文本结构的网页去重
引用本文:魏丽霞,郑家恒.基于网页文本结构的网页去重[J].计算机应用,2007,27(11):2854-2856.
作者姓名:魏丽霞  郑家恒
作者单位:山西大学计算机与信息技术学院,太原030006
基金项目:国家自然科学基金 , 山西省自然科学基金
摘    要:搜索引擎返回的重复网页不但浪费了存储资源,而且加重了用户浏览的负担。 针对网页重复的特征和网页文本自身的特点,提出了一种动态的网页去重方法。该方法通过将网页的正文表示成目录结构树的形式,实现了一种动态的特征提取算法和层次指纹的相似度计算算法。实验证明,该方法对全文重复和部分重复的网页都能进行准确的检测。

关 键 词:层次指纹  文本结构  网页去重
文章编号:1001-9081(2007)11-2854-03
收稿时间:2007-05-28
修稿时间:2007年5月28日

Detection and elimination of similar Web pages based on text structure
WEI Li-xia,ZHENG Jia-heng.Detection and elimination of similar Web pages based on text structure[J].journal of Computer Applications,2007,27(11):2854-2856.
Authors:WEI Li-xia  ZHENG Jia-heng
Abstract:Similar Web pages that search engine returns not only waste storage resources but also increase the burden on Web users. A dynamic method to detect similar Web pages was proposed. By this method, Texts of Web pages were expressed in the style of catalogue structure trees according to the features of similar Web pages and the features of Web pages themselves. A dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similar degree were implemented. The experimental results show that completely similar Web pages are detected accurately, and partly similar Web pages are detected exactly.
Keywords:layer fingerprint  text structure  detection and elimination of similar Web pages
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号