基于网页文本结构的网页去重 Detection and elimination of similar Web pages based on text structure期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于网页文本结构的网页去重

引用本文：	魏丽霞,郑家恒.基于网页文本结构的网页去重[J].计算机应用,2007,27(11):2854-2856.

作者姓名：	魏丽霞郑家恒

作者单位：	山西大学计算机与信息技术学院,太原030006

基金项目：	国家自然科学基金 , 山西省自然科学基金

摘要：	搜索引擎返回的重复网页不但浪费了存储资源，而且加重了用户浏览的负担。针对网页重复的特征和网页文本自身的特点，提出了一种动态的网页去重方法。该方法通过将网页的正文表示成目录结构树的形式，实现了一种动态的特征提取算法和层次指纹的相似度计算算法。实验证明，该方法对全文重复和部分重复的网页都能进行准确的检测。
关键词：	层次指纹文本结构网页去重
文章编号：	1001-9081(2007)11-2854-03
收稿时间：	2007-05-28
修稿时间：	2007年5月28日
Detection and elimination of similar Web pages based on text structure

WEI Li-xia,ZHENG Jia-heng.Detection and elimination of similar Web pages based on text structure[J].journal of Computer Applications,2007,27(11):2854-2856.

Authors:	WEI Li-xia ZHENG Jia-heng

Abstract:	Similar Web pages that search engine returns not only waste storage resources but also increase the burden on Web users. A dynamic method to detect similar Web pages was proposed. By this method, Texts of Web pages were expressed in the style of catalogue structure trees according to the features of similar Web pages and the features of Web pages themselves. A dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similar degree were implemented. The experimental results show that completely similar Web pages are detected accurately, and partly similar Web pages are detected exactly.

Keywords:	layer fingerprint text structure detection and elimination of similar Web pages
本文献已被维普万方数据等数据库收录！
	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏