网页去重方法研究 Research on elimination of similar web pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

网页去重方法研究

引用本文：	樊勇,郑家恒.网页去重方法研究[J].计算机工程与应用,2009,45(12):141-143.

作者姓名：	樊勇郑家恒

作者单位：	1. 山西大学,计算机与信息技术学院,太原,030006 2. 计算智能与中文信息处理省部共建教育部重点实验室,太原,030006

摘要：	搜索引擎返回的重复网页不但浪费了存储资源,而且加重了用户浏览的负担。针对网页重复的特征,提出了一种基于语义的去重方法。该方法通过句子在文本中的位置和组块的重要度,提取出网页正文的主题句向量,然后对主题句向量进行语义相似度计算,把重复的网页去除。实验证明,该方法对全文重复和部分重复的网页都能进行较准确的检测。
关键词：	组块主题句向量网页去重
收稿时间：	2008-3-6
修稿时间：	2008-5-26
Research on elimination of similar web pages

FAN Yong,ZHENG Jia-heng.Research on elimination of similar web pages[J].Computer Engineering and Applications,2009,45(12):141-143.

Authors:	FAN Yong ZHENG Jia-heng

Affiliation:	FAN Yong1,ZHENG Jia-heng21.Department of Computer , Information Technology,Shanxi University,Taiyuan 030006,China 2.Key Laboratory of Ministry of Education for Computation Intelligence , Chinese Information Processing,China

Abstract:	Similar web pages that search engine returns not only waste storage resources but also increase the burden on web users.In this paper,a method based on semantic to detect similar web pages is proposed.This method picks up topic sentence vector of web pages through location of the sentence in the text and importance of chunking.Then it detects the similar web pages by calculating semantic similar degree of topic sentence vector.The experiment results show that not only completely similar web pages are detect...

Keywords:	chunking topic sentence vector elimination of similar web pages
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏